PCIe as a chip-to-chip interconnect

Yousuf Khan

unread,

Oct 12, 2004, 4:18:31 PM10/12/04

to

It was originally thought that PCI-Express was intended as a competitor to
AMD's Hypertransport as a chip-to-chip interconnect technology. Is this
still thought of as the case or have they decided that PCIe is simply a
peripheral interconnect, much like PCI was before it?

Yousuf Khan

--
Humans: contact me at ykhan at rogers dot com
Spambots: just reply to this email address ;-)

Peter Boyle

unread,

Oct 12, 2004, 8:32:07 PM10/12/04

to

On Tue, 12 Oct 2004, Yousuf Khan wrote:

> It was originally thought that PCI-Express was intended as a competitor to
> AMD's Hypertransport as a chip-to-chip interconnect technology. Is this
> still thought of as the case or have they decided that PCIe is simply a
> peripheral interconnect, much like PCI was before it?

Since peripheral interconnects still tend to connect chips to chips,
I guess you're meaning for implementing coherent SMP between CPU's?

While I don't know about that, PCIe has 7m cable specs, which
makes it a candidate for point-to-point _inter_board connectivity, unlike
HT.

Peter

> Yousuf Khan
>
> --
> Humans: contact me at ykhan at rogers dot com
> Spambots: just reply to this email address ;-)
>
>
>

Peter Boyle pbo...@physics.gla.ac.uk

del cecchi

unread,

Oct 12, 2004, 9:42:03 PM10/12/04

to

"Peter Boyle" <pbo...@holyrood.ed.ac.uk> wrote in message
news:Pine.GSO.4.58.04...@holyrood.ed.ac.uk...

>
> On Tue, 12 Oct 2004, Yousuf Khan wrote:
>
> > It was originally thought that PCI-Express was intended as a
competitor to
> > AMD's Hypertransport as a chip-to-chip interconnect technology. Is
this
> > still thought of as the case or have they decided that PCIe is
simply a
> > peripheral interconnect, much like PCI was before it?
>
> Since peripheral interconnects still tend to connect chips to chips,
> I guess you're meaning for implementing coherent SMP between CPU's?
>
> While I don't know about that, PCIe has 7m cable specs, which
> makes it a candidate for point-to-point _inter_board connectivity,
unlike
> HT.
>
> Peter
>

Strictly speaking, HT as documented in the HT specification at
www.hypertransport.org isn't coherent either. The coherence is AMD
proprietary extension.

It sure looks like PCI-E is intended more for I/O, but it could be used
for chip to chip as well I would think. I guess one could read both
protocol specifications and judge for oneself, should one be into such
things. What are the characteristics that would distinguish a "chip to
chip" technology from a peripherial interconnect? (says the circuit
designer)

del cecchi

Yousuf Khan

unread,

Oct 13, 2004, 3:28:17 AM10/13/04

to

Peter Boyle wrote:
> A peripheral interconnect ultimately connects chips to chips, so all I
> could think of was that the OP meant CPU-CPU interconnect, which is
> why I raised cHT.

Well no, not necessarily just CPU-CPU connectivity. HT is also used to
connect let's say the northbridge to the southbridge of a chipset.
Basically, what I'm talking about is connecting many different kinds of
chips within a single system board. Although PCIe would ultimately be used
to connect chips to chips, I was really referring to a more direct
connections between chips, as opposed to a lot of glue in between the chips.

Yousuf Khan

Kai Harrekilde-Petersen

unread,

Oct 13, 2004, 4:08:32 AM10/13/04

to

"Yousuf Khan" <bbb...@ezrs.com> writes:

As far as I've understood on PCIe, it's limited to strictly tree-like
topologies (ie two endpoint cannot communicate together directly).

This is one of the limitations what the PCI-AS folks are trying to
remove.

Kai
--
Kai Harrekilde-Petersen <khp(at)harrekilde(dot)dk>

Stephen Fuld

unread,

Oct 14, 2004, 12:26:33 PM10/14/04

to

"del cecchi" <dcecchi...@att.net> wrote in message
news:2t62beF...@uni-berlin.de...
>
> "Yousuf Khan" <bbb...@ezrs.com> wrote in message
> news:X7KdnVFGSbs...@rogers.com...

>> Kai Harrekilde-Petersen wrote:
>> > As far as I've understood on PCIe, it's limited to strictly
> tree-like
>> > topologies (ie two endpoint cannot communicate together directly).
>> >
>> > This is one of the limitations what the PCI-AS folks are trying to
>> > remove.
>>

>> Well, that's basically what I was trying to find out, whether PCIe is
>> practical as a chip-to-chip connector yet. I would assume that the
> amount of
>> design work needed to implement interconnects using PCIe would be much
> more
>> complicated than Hypertransport, right?
>
> Sorry, missed this question. I think they are both about the same
> amount of work, unless you have buddies at AMD who will give you their
> coherent HT design logic. Otherwise, architecturally the two are sort
> of similar. I think they are both designed like PCI. Or you could use
> IB. It's not coherent either, but it has the other attributes you need.
> (plug). Or you could just buy SGI and sell the parts you don't need.
> They have a quite nice NUMA SAN and they can be had for only 365 million
> dollars. :-) Think of it as 1500 Person years, and you get installed
> base, people, and real estate thrown in. Of course you have to add
> another 100 Million for the debt. :-)

Some years ago, the company I worked for got pitched by NCR on buying what
they called their System Area Net. It may still be available.

--
- Stephen Fuld
e-mail address disguised to prevent spam

Kai Harrekilde-Petersen

unread,

Oct 13, 2004, 4:09:54 PM10/13/04

to

"Yousuf Khan" <bbb...@ezrs.com> writes:

> Kai Harrekilde-Petersen wrote:
>> As far as I've understood on PCIe, it's limited to strictly tree-like
>> topologies (ie two endpoint cannot communicate together directly).
>>
>> This is one of the limitations what the PCI-AS folks are trying to
>> remove.
>

> Well, that's basically what I was trying to find out, whether PCIe is
> practical as a chip-to-chip connector yet. I would assume that the amount of
> design work needed to implement interconnects using PCIe would be much more
> complicated than Hypertransport, right?

Your guess is as good as mine. I hardly know anything about PCIe, and
even less about HT. The little I know about PCIe comes from attending
the PCI-AS / PI-Ethernet conference calls.

Del Cecchi

unread,

Oct 14, 2004, 2:30:27 PM10/14/04

to

"Stephen Fuld" <s.f...@PleaseRemove.att.net> wrote in message
news:ZMxbd.548123$OB3.3...@bgtnsc05-news.ops.worldnet.att.net...

IBM would probably sell a NUMA interconnect network or two also. But SGI
seemed like such a deal. :-)

del cecchi
>
>

Del Cecchi

unread,

Oct 13, 2004, 3:29:22 PM10/13/04

to

"Yousuf Khan" <bbb...@ezrs.com> wrote in message
news:X7KdnVFGSbs...@rogers.com...
> Kai Harrekilde-Petersen wrote:

> > As far as I've understood on PCIe, it's limited to strictly tree-like
> > topologies (ie two endpoint cannot communicate together directly).
> >
> > This is one of the limitations what the PCI-AS folks are trying to
> > remove.
>

> Well, that's basically what I was trying to find out, whether PCIe is
> practical as a chip-to-chip connector yet. I would assume that the amount
of
> design work needed to implement interconnects using PCIe would be much
more
> complicated than Hypertransport, right?
>

> I guess this brings me to a corollary to my original question. Is PCIe an
> appropriate intererconnect for connecting CPUs to each other like
> coherent-HT? Has it got the appropriate features needed for connecting
multi
> CPUs (eg. cache coherency, latency, etc.)? Or is PCIe simply a peripheral
> interconnect and that's all we should look upon it as?
>
> Yousuf Khan
>
It is not a cache coherent thing, more like serial PCI. So it isn't a NUMA
appropriate interconnect. Aren't all NUMA interconnects proprietary? I
can't think of one that isn't. (well, SCI isn't, but I don't know anyone
using it anymore without modifications)(but I could easily be mistaken)

del cecchi

unread,

Oct 15, 2004, 10:06:41 PM10/15/04

to

"TC" <no...@noplace.com> wrote in message
news:vm%bd.5847$NX5....@newsread3.news.atl.earthlink.net...
> PCI Express (PCIe) is definitely not suited for CPU-to-CPU
communication. It
> preserves the legacy PCI model where a CPU sits at the top of a flat
global
> address space that is partitioned across a hierarchy of bus segments
by
> bridges (i.e. a tree topology). PCIe packets (i.e. PCI transactions)
are
> routed up or down the tree topology using address ranges. Like PCI,
the CPU
> at the top of the PCIe hierarchy is the origin of global resets, and
the
> default (and only) place that certain interrupts and events can be
routed
> to. There are no provisions for a cache coherency protocol, nor are
there
> provisions for CPU-to-CPU communication of any sort. PCI, and thus
PCIe CPUs
> (and their chipsets) assume that they own all devices in the topology
and
> that they configure their resources. These should all be familiar
issues to
> designers of PCI systems. My view is that PCIe is simply the
transition of
> PCI from a bus-based signaling technology to a point-to-point serial
> signaling technology and not much more than that. PCI Express is
defined by
> the PCI SIG (www.pcisig.com).
>
> The Advanced Switching Interconnect (ASI) is defined by the ASI SIG
> (www.asi-sig.org). PCIe and ASI share a common physical and link layer
> protocol. ASI was defined to provide more advanced capabilities than
PCIe
> but also to be complimentary to PCIe. In fact, ASI supports the
transparent
> tunneling of PCIe so that you can have a PCIe CPU and a PCIe
peripheral
> interconnected by ASI and maintain strict hardware and software
> compatibility (including plug-and-play). Virtually all ASI-based
systems
> will have both ASI and PCIe devices in them. ASI overcomes the legacy
> limitations of PCI (and PCIe) by providing an advanced routing model,
adding
> support for multicast, a rich CoS/QoS capability, advanced event
signaling,
> provisions for High-Availability, and more. ASI is a true switched
> interconnect capable of transporting packets that encapsulate
virtually any
> protocol. Switches do not care about the contents of the packet, they
one
> care about the routing of packets. Currently, there is no
cache-coherence
> protocol defined but there are defined protocols for queue based
messaging
> as well as very low overhead data transport (RDMA like) for CPU-to-CPU
> communication.
>
> Bottom line... if you are trying to build a stand-alone, single CPU
system
> with traditional IO devices (no peer-to-peer communication) then PCIe
is a
> great choice. If you are trying to build a more advanced system that
has
> multiple CPUs, peer-to-peer communication, and other more advanced
> capabilities the combination of ASI and PCIe is the way to go.
>
> Hope this helps.
>
> TC
>
snip

Way cool. The AS guys just reinvented InfiniBand. They had to
rearrange their faces and give them all another name, as someone once
said. I wonder if these are the same guys that were bitching about IB
being too complex.

del cecchi
>

TC

unread,

Oct 15, 2004, 10:23:07 PM10/15/04

to

While it is simple for those that haven't studied ASI to compare it to
Infiniband, it is not. For one thing, compatibility to the legacy
infrastructure was not an afterthought. I would suggest that these kind of
comparisons be left to those who have intimate knowledge of both ASI and
Infiniband.

TC

"del cecchi" <dcecchi...@att.net> wrote in message

news:2tbeebF...@uni-berlin.de...

Peter Boyle

unread,

Oct 16, 2004, 12:33:52 AM10/16/04

to

On Sat, 16 Oct 2004, TC wrote:

> While it is simple for those that haven't studied ASI to compare it to
> Infiniband, it is not. For one thing, compatibility to the legacy
> infrastructure was not an afterthought. I would suggest that these kind of
> comparisons be left to those who have intimate knowledge of both ASI and
> Infiniband.

Pointless post.

Either educate us with the facts behind your statements, you pompous git,
or shut up and don't bother posting.

Peter

Tony Nelson

unread,

Oct 16, 2004, 6:11:52 PM10/16/04

to

In article <Pine.GSO.4.58.04...@holyrood.ed.ac.uk>,
Peter Boyle <pbo...@holyrood.ed.ac.uk> wrote:

You could read the specs yourself. Doing that would seem to be a
prerequisite to having "intimate knowledge".

TC -- what you posted was useful and interesting. More might also be
useful and interesting.
____________________________________________________________________
TonyN.:' tony...@shore.net
'

Nick Maclaren

unread,

Oct 17, 2004, 6:12:50 AM10/17/04

to

In article <tonynlsn-C09E26...@news.primus.ca>,

Tony Nelson <tony...@shore.net> wrote:
>> On Sat, 16 Oct 2004, TC wrote:
>>
>> > While it is simple for those that haven't studied ASI to compare it to
>> > Infiniband, it is not. For one thing, compatibility to the legacy
>> > infrastructure was not an afterthought. I would suggest that these kind of
>> > comparisons be left to those who have intimate knowledge of both ASI and
>> > Infiniband.
>

>You could read the specs yourself. Doing that would seem to be a
>prerequisite to having "intimate knowledge".

I have heavily snipped this, because I want to address the non-flame
aspects.

Reading a specification is time-consuming and is roughly O(N).
Studing one is worse, and is at least O(N^2) unless the specification
is unusually orthogonal, but doing so is a prerequisite to getting
"intimate knowledge" of one. In practice, less than 1% of "experts"
have anything approaching intimate knowledge of more than one or
two aspects (often no more than 5-10%) of a complex specification.

I haven't even glanced at ASI, but the Infiniband specification is
some 2,000 pages, and I have reason to believe that there were (and
probably still are) no experts on it, as a whole. Yes, I do mean just
that - NONE. Part of the reason for that is that such experts also
need a good knowledge of the standards and implementations for the
environment in which it will be used (e.g. POSIX, Unix-like systems,
MPI, TCP/IP etc.)

People may remember me saying a LONG time back that InfiniBand's claims
were effectively undeliverable, why, and what subset of its objectives
might be deliverable. And that I was flamed from two sides, neither
of which appeared to have the first idea of the assumptions that the
other side was making. And I should like to point out that currently
planned releases are almost exactly as I described :-)

Please note - I am NOT, repeat NOT, an InfiniBand expert. I merely
read SOME of the specification and spotted a few aspects where it had
implicitly made assumptions that were not so. I then used my mainframe
experience, knowledge of relevant standards and implementations, and
deductive capacities to make my predictions.

If there HAD been any REAL experts with "intimiate knowledge" of the
whole of InfiniBand, they would have spotted that at the design stage
and either modified the specification to alleviate the problem, or
modified the claims.

I doubt that ASI is much different, and I am certain that there is
nobody who is promiscuously intimate with InfiniBand, ASI and all of
the required contextual standards.

Regards,
Nick Maclaren.

Yousuf Khan

unread,

Oct 17, 2004, 1:15:51 PM10/17/04

to

TC <no...@noplace.com> wrote:
> PCI Express (PCIe) is definitely not suited for CPU-to-CPU
> communication. It preserves the legacy PCI model where a CPU sits at
> the top of a flat global address space that is partitioned across a
> hierarchy of bus segments by bridges (i.e. a tree topology). PCIe
> packets (i.e. PCI transactions) are routed up or down the tree
> topology using address ranges. Like PCI, the CPU at the top of the
> PCIe hierarchy is the origin of global resets, and the default (and
> only) place that certain interrupts and events can be routed to.
> There are no provisions for a cache coherency protocol, nor are there
> provisions for CPU-to-CPU communication of any sort. PCI, and thus
> PCIe CPUs (and their chipsets) assume that they own all devices in
> the topology and that they configure their resources. These should
> all be familiar issues to designers of PCI systems. My view is that
> PCIe is simply the transition of PCI from a bus-based signaling
> technology to a point-to-point serial signaling technology and not
> much more than that. PCI Express is defined by the PCI SIG
> (www.pcisig.com).

Okay, thanks, that's what I was looking for.

> The Advanced Switching Interconnect (ASI) is defined by the ASI SIG
> (www.asi-sig.org). PCIe and ASI share a common physical and link layer
> protocol. ASI was defined to provide more advanced capabilities than
> PCIe but also to be complimentary to PCIe. In fact, ASI supports the
> transparent tunneling of PCIe so that you can have a PCIe CPU and a
> PCIe peripheral interconnected by ASI and maintain strict hardware
> and software compatibility (including plug-and-play).

This is the first time I'm hearing about ASI, when was the group formed? Was
this group known as something else previously?

So it basically looks like they're using the wires and connectors of PCIe to
create a switched fabric. Judging by the memberlist, both Intel and AMD are
involved, although Intel is a board member, while AMD is only a sponsoring
member. So I assume that Intel started this group as a way to look into
turning its own PCIe infrastructure into a CPU switching fabric? While it
seems like AMD is tagging along just to know how the competition is coming
along?

Yousuf Khan

Stephen Fuld

unread,

Oct 17, 2004, 5:18:20 PM10/17/04

to

"Yousuf Khan" <bbb...@ezrs.com> wrote in message

news:UNednZuDpNL...@rogers.com...

snip

> This is the first time I'm hearing about ASI, when was the group formed?
> Was this group known as something else previously?
>
> So it basically looks like they're using the wires and connectors of PCIe
> to create a switched fabric.

Seems that way.

> Judging by the memberlist, both Intel and AMD are involved, although Intel
> is a board member, while AMD is only a sponsoring member. So I assume
> that Intel started this group as a way to look into turning its own PCIe
> infrastructure into a CPU switching fabric?

Well, this would be the third time that Intel has tried to solve the
problems of large scale peripheral (storage and network) and cluster
interconnect. First was NGIO, then Infiniband. I view ASI as yet another
attempt to get it done, but to avoid some of the incompatibilities with
lagacy stuff that has caused IB to be more of a niche product than something
that would be ubiquitous on at least servers, if not desktops. The high end
guys at Intel lost the battle when Intel chose PCI-E over IB, but still have
the same problems to solve as what drove them towards NGIO and IB. As for a
CPU switching fabric, I guess it depends upon what you mean. I don't see
anything about cache coherance, etc. (I am not a member and haven't looked
at the spec), but I suspect that it is not intended as a ccNUMA fabric,
though perhaps it could be adapted for such use.

TC

unread,

Oct 17, 2004, 5:53:15 PM10/17/04

to

Peter,

I think you misunderstood my post, but I do see how you could. I do not
claim to be an expert on Infiniband. So, I exclude myself from those who

have intimate knowledge of both ASI and Infiniband.

The sole intent of my post was to challenge the assertion the ASI is just a
re-invention of Infiniband. From Del's prior responses to the threads I
thought it was clear that he wasn't an expert on ASI (or PCIe for that
matter). I know that Del is an expert in many areas and a valued and
respected contributor to this news group and I certainly did not intend to
attack him personally. I do think it is important that people challenge and
critique others ideas however.

Del, if I did offend you I offer my humble apology. It was not my
intent.

As far as your assertion that I am a pompous git... hey, its free speech AND
free thought, right? I trust others to form there own opinions of me, and of
your remark.

I agree strongly with your stated desire that posters contribute thier
knowledge to this forum. I also believe that there are many like me, who
would contibute more frequently if there were fewer posts making personal
attacks.

Respectfully,
TC

"Peter Boyle" <pbo...@holyrood.ed.ac.uk> wrote in message
news:Pine.GSO.4.58.04...@holyrood.ed.ac.uk...
>
>

TC

unread,

Oct 17, 2004, 9:15:26 PM10/17/04

to

Sadly, I think you are right. There probably isn't anyone that is 'expert'
in both technologies. The point you make about the difficulty of becoming an
expert in any one of these complex technologies is quite true. However, I
don't think it is impossible that there are people that are knowledgeable
about both technologies and can discuss many (but not all) aspects of them
intelligently. Maybe even that is hoping for too much.

TC

"Nick Maclaren" <nm...@cus.cam.ac.uk> wrote in message
news:cktgj2$a9t$1...@gemini.csx.cam.ac.uk...

TC

unread,

Oct 17, 2004, 9:17:36 PM10/17/04

to

The origin of the ASI SIG was the Arapahoe Working Group. ASI has been in
development for over two years. I can't comment on Intel's or AMD's
aspirations or interest in the standard.

TC

"Yousuf Khan" <bbb...@ezrs.com> wrote in message
news:UNednZuDpNL...@rogers.com...

TC

unread,

Oct 17, 2004, 9:26:22 PM10/17/04

to

ASI is certainly not (currently) targeting use as a ccNUMA fabric but I
suppose it could be adapted for that purpose.

One aspect of ASI is that it is very extensible (in a way that does not
affect the switches). There are even provisions for vendor specific protocol
extensions. The key capability is that the routing header is completely
separate from the contents of the packet being routed. The header contains a
Protocol Identifier (PI) field that is set by the source node to indicate to
the target node how to interpret the packet payload. The packet payload may
be pure data or it may contain a protocol specific header and payload data.

TC

"Stephen Fuld" <s.f...@PleaseRemove.att.net> wrote in message

news:wkBcd.712200$Gx4....@bgtnsc04-news.ops.worldnet.att.net...

del cecchi

unread,

Oct 17, 2004, 10:47:36 PM10/17/04

to

"TC" <no...@noplace.com> wrote in message

news:fRBcd.7307$NX5....@newsread3.news.atl.earthlink.net...

Ah but in many ways, from your description it is a reinvention of IB.
Now it might be that IB screwed it up the first time as Nick claims, and
ASI will do it better. I guess we shall see. You didn't offend me at
all. All that high level architecture stuff is well above my area of
competence. I thought it sort of amusing that after coming out with
PCI-e because IB was "too complex" now they are adding in many of the
things that made IB complex in the first place. Intel was one of the
big dogs in IB, along with the independent thinkers at Dell and I don't
recall the relation with PCI being a major concern at the time. I have
theories about why things happened the way they did, but I won't
embarrass myself by expounding them.

Don't worry about offending me with technical information. And let me
know if you feel personally attacked by me, because it is purely
unintentional and I will apologise.

del cecchi

Nick Maclaren

unread,

Oct 18, 2004, 8:57:58 AM10/18/04

to

In article <2tgp8kF...@uni-berlin.de>,

del cecchi <dcecchi...@att.net> wrote:
>
>>
>Ah but in many ways, from your description it is a reinvention of IB.
>Now it might be that IB screwed it up the first time as Nick claims, and
>ASI will do it better. I guess we shall see. You didn't offend me at
>all. All that high level architecture stuff is well above my area of
>competence. I thought it sort of amusing that after coming out with
>PCI-e because IB was "too complex" now they are adding in many of the
>things that made IB complex in the first place. Intel was one of the
>big dogs in IB, along with the independent thinkers at Dell and I don't
>recall the relation with PCI being a major concern at the time. I have
>theories about why things happened the way they did, but I won't
>embarrass myself by expounding them.

Hang on. There are only two aspects where I have said that InfiniBand
screwed up, both have been confirmed, but one was bypassable by not
using a large chunk of its claimed function and the other was and is
merely a potential for failure.

I said that allowing users arbitrary use of RDMA was a predictable
disaster, and the solution was to forbid it except for use by device
drivers, MPI in HPC applications, and similar controlled environments.
Lo! and behold! That is just what has been done :-)

I said that the complexity was a software engineering disaster. Well,
it is, but who gives a damn about that? What I can't say is what the
consequences are - which is precisely the point. All of System/360,
TCP/IP and IA64 have the same flaw, as did half the cancelled
military aircraft projects of the past half century. Anyone remember
the TSR2 (if I have recalled its code correctly)?

It wouldn't be hard to check if ASI has made the same mistake as
InfiniBand as far as the first aspect goes, as it is a specific
assumption that I can demonstrate is wrong.

The second is undecidable except in hindsight. If a specification
is too complex to analyse, it is clearly impossible to know whether
it will or will not have an insoluble problems lurking within.

Regards,
Nick Maclaren.

Del Cecchi

unread,

Oct 18, 2004, 9:35:28 AM10/18/04

to

"Yousuf Khan" <bbb...@ezrs.com> wrote in message
news:UNednZuDpNL...@rogers.com...

I guess I can't learn about ASI because one has to be a member to see
anything except fluff and IBM isn't.

To me personally it looks like an Intel front organization. None of the
board members seem to have any computer or system architecture expertise. I
bet the IP provision is also "all your patent are belong to us" also. Said
to be in 4.13 of the bylaws, but I couldn't find the bylaws. :-(

del cecchi
>

Del Cecchi

unread,

Oct 18, 2004, 9:46:21 AM10/18/04

to

"TC" <no...@noplace.com> wrote in message

news:2ZEcd.10264$SZ5....@newsread2.news.atl.earthlink.net...

> ASI is certainly not (currently) targeting use as a ccNUMA fabric but I
> suppose it could be adapted for that purpose.
>
> One aspect of ASI is that it is very extensible (in a way that does not
> affect the switches). There are even provisions for vendor specific
protocol
> extensions. The key capability is that the routing header is completely
> separate from the contents of the packet being routed. The header contains
a
> Protocol Identifier (PI) field that is set by the source node to indicate
to
> the target node how to interpret the packet payload. The packet payload
may
> be pure data or it may contain a protocol specific header and payload
data.
>
> TC
>

Like IB doesn't already have this? Do you know anything about IB? Check
out V1 Ch9.

Separating payload and routing, what a concept!

del cecchi

Yousuf Khan

unread,

Oct 18, 2004, 10:27:53 AM10/18/04

to

TC wrote:
> The origin of the ASI SIG was the Arapahoe Working Group. ASI has
> been in development for over two years. I can't comment on Intel's or
> AMD's aspirations or interest in the standard.

Aha! Arapahoe (aka 3GIO) was what eventually became PCIe. So it looks like
they split off the more complex issues and put them into this ASI group,
while getting simple peripheral connect architecture ready through the PCIe
group.

Yousuf Khan

Rupert Pigott

unread,

Oct 18, 2004, 10:34:22 AM10/18/04

to

Nick Maclaren wrote:

[SNIP]

> military aircraft projects of the past half century. Anyone remember
> the TSR2 (if I have recalled its code correctly)?

TSR2 seemed to suffer from a tragic but terminal moving goalpost
syndrome. The requirements were pretty extreme too. The F-111 seems
like the nearest alternative, and that was *very* late and *very*
expensive. There is a substantial amount of debate as to whether
the F-111 was anything like as capable as the TSR2.

TSR2 wasn't a complete waste of time though, the engines were
developed and they powered Concorde. Apparently the TSR2 also
contributed a fair amount of know-how and technology to the MRCA.
That would not surprise me as the MRCA pretty much does what the
TSR2 was meant to do 20+ years earlier.

At the end of the day the MRCA (Panavia Tornado) ended up
doing what TSR2 was aiming at and apparently it borrowed fairly
heavily from it. IIRC I read a claim that MRCA was one of the
few military aircraft development projects to come in on time
and under budget...

That reminds me : I really need to see the surviving airframe at
Duxford.

Cheers,
Rupert
--
Threading sequential code through the eye of a parallel needle
makes little sense. ;)

Peter Dickerson

unread,

Oct 18, 2004, 10:43:12 AM10/18/04

to

"Rupert Pigott" <dark...@try-removing-hotmail-this.com> wrote in message
news:10981100...@teapot.planet.gong...

Duxford! I was there yesterday - although not at the museum.

Peter

Yousuf Khan

unread,

Oct 18, 2004, 11:05:02 AM10/18/04

to

Del Cecchi wrote:
> I guess I can't learn about ASI because one has to be a member to see
> anything except fluff and IBM isn't.

Yeah, but I'm wondering why Cisco is a member? How could it possibly use
this stuff?

> To me personally it looks like an Intel front organization. None of
> the board members seem to have any computer or system architecture
> expertise. I bet the IP provision is also "all your patent are belong
> to us" also. Said to be in 4.13 of the bylaws, but I couldn't find
> the bylaws. :-(

That's what I thought too. AMD is probably a member for defensive reasons,
to make sure it has some advanced notice about what Intel plans to do.

Yousuf Khan

del cecchi

unread,

Oct 18, 2004, 9:18:54 PM10/18/04

to

"Yousuf Khan" <bbb...@ezrs.com> wrote in message

news:_ZmdnWfAVof...@rogers.com...

They had to do it that way if they were going to head off InfiniBand
becoming established. Worked like a charm. I wonder if the guys from
Intel and Dell knew what was up?

del cecchi

Peter Boyle

unread,

Oct 19, 2004, 12:18:34 AM10/19/04

to

Dear TC,

> I think you misunderstood my post, but I do see how you could. I do not
> claim to be an expert on Infiniband. So, I exclude myself from those who
> have intimate knowledge of both ASI and Infiniband.

Firstly, I apologise for the knee jerk agression. Despite this I (hope I)
did recognise the unusually high educational value of your first post -
this made my reading of the second very disappointing.

Secondly, I now see your very well explained position, and I
*very* respectfully apologise for my reaction.
While you are not claiming to be an expert, you *do* know far more than
most of us in this area and I do think could make a very educational post
on this matter.

If I can persuade all to ignore the flame in my post - this is my point:

You mention that to those who know more there are fundamental differences
that render the comparison pointless. I, (and I believe others with a
calmer disposition), would very much like you to elucidate and educate us
all with the reasons behind your post.

> I agree strongly with your stated desire that posters contribute thier
> knowledge to this forum. I also believe that there are many like me, who
> would contibute more frequently if there were fewer posts making personal
> attacks.

I apologise once more, and encourage knowledgable lurkers that assholes
like myself can reform...

Thanks,

Peter

> The sole intent of my post was to challenge the assertion the ASI is just a
> re-invention of Infiniband. From Del's prior responses to the threads I
> thought it was clear that he wasn't an expert on ASI (or PCIe for that
> matter). I know that Del is an expert in many areas and a valued and
> respected contributor to this news group and I certainly did not intend to
> attack him personally. I do think it is important that people challenge and
> critique others ideas however.
>
> Del, if I did offend you I offer my humble apology. It was not my
> intent.
>
> As far as your assertion that I am a pompous git... hey, its free speech AND
> free thought, right? I trust others to form there own opinions of me, and of
> your remark.
>

> Respectfully,

> TC
>
>
> "Peter Boyle" <pbo...@holyrood.ed.ac.uk> wrote in message
> news:Pine.GSO.4.58.04...@holyrood.ed.ac.uk...
> >
> >
> > On Sat, 16 Oct 2004, TC wrote:
> >
> >> While it is simple for those that haven't studied ASI to compare it to
> >> Infiniband, it is not. For one thing, compatibility to the legacy
> >> infrastructure was not an afterthought. I would suggest that these kind
> >> of
> >> comparisons be left to those who have intimate knowledge of both ASI and
> >> Infiniband.
> >
> > Pointless post.
> >
> > Either educate us with the facts behind your statements, you pompous git,
> > or shut up and don't bother posting.
> >
> > Peter
> >
> >
> >
>
>
>

Peter Boyle pbo...@physics.gla.ac.uk

Yousuf Khan

unread,

Oct 19, 2004, 3:23:12 AM10/19/04

to

del cecchi wrote:
> They had to do it that way if they were going to head off InfiniBand
> becoming established. Worked like a charm. I wonder if the guys from
> Intel and Dell knew what was up?

I thought they trying to head off Hypertransport?

Anyways, why would Intel be against IB, when it was one of the original
sponsors? What's the history of that? You mentioned Dell, what's it got to
do with it, one way or another?

Yousuf Khan

Del Cecchi

unread,

Oct 19, 2004, 11:34:18 AM10/19/04

to

"Yousuf Khan" <bbb...@ezrs.com> wrote in message

news:_tydnWLb-Yx...@rogers.com...

Dell was one of the original gang of seven that comprised the "merger" of
future I/O and NGIO that became InfiniBand trade association. Dell made
considerable contributions to the Electrical working group. I don't know
about other groups. But in the case of a disagreement or controversy they
could be counted on to vote with Intel. Not a surprise really.

Then about the time the IB spec was released, Intel came up with PCI-express
and handed it to the SIG. I always thought it was some sort of Intel
internal thing between the server guys and the desktop guys. As has been
noted in this group, IB is server oriented and has a lot of stuff that
likely wouldn't be necessary in the desktop environment. But the whole
episode is sort of puzzling.

Hypertransport is different kettle of fish. It is a chip to chip connect,
like a front side bus. It is not appropriate for board to board or rack to
rack, although it might be some day.

del cecchi

Patrick Geoffray

unread,

Oct 19, 2004, 12:03:47 PM10/19/04

to TC

TC wrote:
> One aspect of ASI is that it is very extensible (in a way that does not
> affect the switches). There are even provisions for vendor specific protocol
> extensions. The key capability is that the routing header is completely
> separate from the contents of the packet being routed. The header contains a
> Protocol Identifier (PI) field that is set by the source node to indicate to
> the target node how to interpret the packet payload. The packet payload may
> be pure data or it may contain a protocol specific header and payload data.

You are exactely describing the header in the ANSI/VITA 26-1998
standard, aka Myrinet :-)

Providing a generic encapsulation mechanism is definitively the right
thing to do, it lets you seperate the transport from the protocol. This
is, IMHO, where IB took the wrong turn, defining transport and protocole
in the same specs (even if it did not need to). It makes the spec huge
and it requires a lot of work to implement. Either you implement
everything and there are things you don't do well because it's not your
domain of expertise, or you implement just parts of it (transport or
protocol) and you struggle with compatibility issues. Furthermore, you
need somebody else's parts and that means you need critical mass.

What happened in the last 5 years is that companies who did implement
parts of it died or merged because the critical mass was not there and
the ones who try to do everything did some parts poorly (or the spec was
just poorly designed in some cases, like trying to recycle VIA).

ASI is taking some wrong turns on its own AFAIK. They want to do QoS,
that means priorities and buffering and buffer management, etc. It seems
quite orthogonal to the simplicity of source-based routing and history
has shown that QoS never really work and eventually ends up a big gas
factory (remember ATM ?). There is still hope though, that one will be
able to use the buffers only to improve HOL blocking and scrap the QoS
stuff.

Patrick

Patrick Geoffray

unread,

Oct 19, 2004, 1:14:45 PM10/19/04

to Del Cecchi

Del Cecchi wrote:
> Dell was one of the original gang of seven that comprised the "merger" of
> future I/O and NGIO that became InfiniBand trade association. Dell made
> considerable contributions to the Electrical working group. I don't know
> about other groups. But in the case of a disagreement or controversy they
> could be counted on to vote with Intel. Not a surprise really.
>
> Then about the time the IB spec was released, Intel came up with PCI-express
> and handed it to the SIG. I always thought it was some sort of Intel
> internal thing between the server guys and the desktop guys. As has been

My own complot theory is that Intel used IB as a decoy: push the
bandwagon a little, just enough to see people jumping on it (and invest
on it), and then turn around and put all of their weight on PCI-Express.
Intel would prefer PCI-Express because it would take back the control
it lost in the Future-IO / NGIO merge and what a nice way to have other
vendors waste R&D money.

However, an internal clash between Server and Desktop folks at Intel
sounds much more realistic :-)

Patrick

Stephen Fuld

unread,

Oct 19, 2004, 1:46:41 PM10/19/04

to

"del cecchi" <dcecchi...@att.net> wrote in message

news:2tj8bvF...@uni-berlin.de...

Comming from IBM, I know that you couldn't believe that there could be
fighting among different groups within a company! :-) Everyone involved at
Intel knew exactly what was going on. The server guys lost the battle and
the desktop guys won.

Stephen Fuld

unread,

Oct 19, 2004, 1:50:24 PM10/19/04

to

"Del Cecchi" <cecchi...@us.ibm.com> wrote in message
news:2tkqfsF...@uni-berlin.de...

snip

> Then about the time the IB spec was released, Intel came up with
> PCI-express
> and handed it to the SIG. I always thought it was some sort of Intel
> internal thing between the server guys and the desktop guys.

Right.

> As has been
> noted in this group, IB is server oriented and has a lot of stuff that
> likely wouldn't be necessary in the desktop environment. But the whole
> episode is sort of puzzling.

Now you have some idea how various internal IBM battles look to the outside!
People wonder what happened and are pissed off that they expended resources
on what turned out to be the "wrong horse".

Yousuf Khan

unread,

Oct 19, 2004, 3:00:58 PM10/19/04

to

Del Cecchi wrote:
> Dell was one of the original gang of seven that comprised the
> "merger" of future I/O and NGIO that became InfiniBand trade
> association. Dell made considerable contributions to the Electrical
> working group. I don't know about other groups. But in the case of
> a disagreement or controversy they could be counted on to vote with
> Intel. Not a surprise really.

Yeah, who'd have thought, eh?

> Then about the time the IB spec was released, Intel came up with
> PCI-express and handed it to the SIG. I always thought it was some
> sort of Intel internal thing between the server guys and the desktop
> guys. As has been noted in this group, IB is server oriented and has
> a lot of stuff that likely wouldn't be necessary in the desktop
> environment. But the whole episode is sort of puzzling.
>
> Hypertransport is different kettle of fish. It is a chip to chip
> connect, like a front side bus. It is not appropriate for board to
> board or rack to rack, although it might be some day.

Well, perhaps it's puzzling because it wasn't intended as a slight against
IB, but as a competitive move against Hypertransport? But it ended up
stomping over IB too; in fact, it probably affected IB more than it affected
HT. I guess because of your close association with the server side of
things, you weren't paying attention to what was happening on the desktop
front, and how threatened Intel was feeling by all of the stuff that AMD was
coming up with. Hypertransport is only a chip-to-chip interconnect, true,
but Intel had to come up with something with more features than HT. It seems
to me that Intel went around telling people not to support HT because they
would have something that could do everything that HT could do and then some
more; so they threw the kitchen sink into this spec just to show how awesome
this thing was going to be, just to get people to pay attention to them.
Hypertransport works with everything: AGP, PCI, PCI-X, IB, proprietary, even
this PCIe. So it didn't matter to HT what its own limitations were, you
could just interface it to another type of interconnect and get the right
features. Needed board-to-board, or rack-to-rack? Just plug an IB router or
something else into HT and you got it. Intel wanted to show that PCIe could
do all of that by itself, without needing anything from outside; from
chip-to-chip, all of the way upto rack-to-rack, probably it's even
considered throwing in wan-to-wan. :-)

I think Intel's panic over HT has subsided a little now (or it's just
resigned to it now), so it's pushed off some of the more radical aspects of
PCIe, into a different working group, like this ASI group.

Yousuf Khan

unread,

Oct 19, 2004, 3:04:22 PM10/19/04

to

Patrick Geoffray wrote:
> TC wrote:
>> One aspect of ASI is that it is very extensible (in a way that does
>> not affect the switches). There are even provisions for vendor
>> specific protocol extensions. The key capability is that the routing
>> header is completely separate from the contents of the packet being
>> routed. The header contains a Protocol Identifier (PI) field that is
>> set by the source node to indicate to the target node how to
>> interpret the packet payload. The packet payload may be pure data or
>> it may contain a protocol specific header and payload data.
>
> You are exactely describing the header in the ANSI/VITA 26-1998
> standard, aka Myrinet :-)

Seems Intel is trying to be all things to all people with PCIe and its
derivatives.

Yousuf Khan

TC

unread,

Oct 19, 2004, 7:54:32 PM10/19/04

to

Nick, very interesting perspective on RDMA. I'd like to understand the
issues in more detail should you have time to explain it.

TC

"Nick Maclaren" <nm...@cus.cam.ac.uk> wrote in message

news:cl0ekm$g7n$1...@gemini.csx.cam.ac.uk...

TC

unread,

Oct 19, 2004, 8:09:26 PM10/19/04

to

I think there might be more information available than you may have
realized. It seemed odd to me that the ASI SIG would not post by-laws so I
went to the main page..

http://asi-sig.org/home

clicked on the "join" link which brought me to...

http://asi-sig.org/join

and found a link to the by-laws on that page...

http://www.asi-sig.org/join/asisig_bylaws.pdf

I didn't specifically checked but I am fairly certain that Intellectual
Property rights are treated in a manner very similar to other standards
organizations (reasonable and non-discriminatory licensing terms for
contributed IP). Because the ASI SIG's orgins were the Arapahoe Work Group I
don't think there is any evil plot by Intel involved here.

On the ASI SIG web site you will also find a link to an education page.
Again, I don't really know what is there. One other source for public
information that I would check out is the Intel IDF presentations. I hope
this helps people that are looking for more information.

TC

"Del Cecchi" <cecchi...@us.ibm.com> wrote in message

news:2thv51F...@uni-berlin.de...

TC

unread,

Oct 19, 2004, 8:11:18 PM10/19/04

to

Del, I get a sense from your post that you are in a rush to validate your
assertion that ASI is just the re-invention of Infiniband. Please tell me it
isn't so !!!

I still believe that this is not a simple comparison to make and certainly
not one that can be done quickly.

My post didn't make any claims that the separation of the routing header
from payload content (the ability to encapsulate) was novel or unique. I
simply stated that it was an aspect (or capability) of ASI. I was just
trying to share info.

For those that want more details... here is some additional info on the ASI
route header.

The ASI route header is 8 bytes. The ASI route header contains everything
switches need to route packets and a field that identifies the format of the
encapsulated packet (the field is called a Protocol Interface, or PI).

The link layer adds 8 bytes of overhead...
- a start byte and stop byte for packet delineation (framing)
- two bytes that contain a packet sequence ID
- four bytes for a 32-bit link layer CRC

This is a total of 16 bytes for the ASI header and link layer overhead.

The encapsulated payload (and any packet (PI) specific header information)
consumes additional bytes.

How does Infiniband compare to this?

TC

"Del Cecchi" <cecchi...@us.ibm.com> wrote in message

news:2thvpeF...@uni-berlin.de...

TC

unread,

Oct 19, 2004, 8:12:35 PM10/19/04

to

Yes. I think this is the way to think about PCIe and ASI.

TC

"Yousuf Khan" <bbb...@ezrs.com> wrote in message

news:_ZmdnWfAVof...@rogers.com...

TC

unread,

Oct 19, 2004, 8:15:31 PM10/19/04

to

Are we back to evil plots again?

Anyone that has worked in large companies, Intel included, knows all too
well that it has virtually impossible to get the whole company to agree on
anything. I think PCIe was motivated by the need for compatibility the
legacy (PC) software infrastructure. But, nobody ever lets me into the
secret club so I don't really know about these things.

TC

"del cecchi" <dcecchi...@att.net> wrote in message
news:2tj8bvF...@uni-berlin.de...
>

TC

unread,

Oct 19, 2004, 8:20:09 PM10/19/04

to

Peter,

Thanks for the apology. I really did believe that you misunderstood my post.
It is always nice to see flames simple go away once misunderstandings are
cleared up!

Now where were we on that technical stuff?

del cecchi

unread,

Oct 19, 2004, 10:55:57 PM10/19/04

to

"TC" <no...@noplace.com> wrote in message

news:W0idd.3063$ta5....@newsread3.news.atl.earthlink.net...

> I think there might be more information available than you may have
> realized. It seemed odd to me that the ASI SIG would not post by-laws
so I
> went to the main page..
>
> http://asi-sig.org/home
>
> clicked on the "join" link which brought me to...
>
> http://asi-sig.org/join
>
> and found a link to the by-laws on that page...
>
> http://www.asi-sig.org/join/asisig_bylaws.pdf
>
> I didn't specifically checked but I am fairly certain that
Intellectual
> Property rights are treated in a manner very similar to other
standards
> organizations (reasonable and non-discriminatory licensing terms for
> contributed IP). Because the ASI SIG's orgins were the Arapahoe Work
Group I
> don't think there is any evil plot by Intel involved here.
>
> On the ASI SIG web site you will also find a link to an education
page.
> Again, I don't really know what is there. One other source for public
> information that I would check out is the Intel IDF presentations. I
hope
> this helps people that are looking for more information.
>
> TC
>

Thanks for finding the bylaws. I wonder if they bothered with any
protection in their architecture. The IDF papers are a good suggestion.

del cecchi

unread,

Oct 19, 2004, 11:03:13 PM10/19/04

to

"TC" <no...@noplace.com> wrote in message

news:G2idd.3070$ta5...@newsread3.news.atl.earthlink.net...

> Del, I get a sense from your post that you are in a rush to validate
your
> assertion that ASI is just the re-invention of Infiniband. Please tell
me it
> isn't so !!!
>
> I still believe that this is not a simple comparison to make and
certainly
> not one that can be done quickly.
>
> My post didn't make any claims that the separation of the routing
header
> from payload content (the ability to encapsulate) was novel or unique.
I
> simply stated that it was an aspect (or capability) of ASI. I was just
> trying to share info.
>

Nope, not in a rush at all. In fact I really don't know enough details
about IB at that level to judge. I took a quick look at volume 1 to try
to find a table of packet formats, and couldn't. But having packets
that are basically a payload with routing and address information, and
CRC, seems common sense.

And rather than "reinvention" which is not a word I recall using (but
they say memory is the second thing to go...) I would say that they are
incorporating many of the same concepts and the same functionality as
IB. Now understand only have a superficial knowledge of the upper
levels of IB and only the information in your posts about ASI so my
opinion is not based on that much. But it stands to reason that if the
two are used for the same purpose they will have much the same function.

del cecchi

Stephen Fuld

unread,

Oct 20, 2004, 1:49:25 AM10/20/04

to

"del cecchi" <dcecchi...@att.net> wrote in message

news:2tm2bkF...@uni-berlin.de...

snip

> Thanks for finding the bylaws. I wonder if they bothered with any
> protection in their architecture. The IDF papers are a good suggestion.

On their web site, I think I also saw that while on-line access to the specs
was for members only, non-members could order paper copies. This seems
strange, but perhaps it is a way to get a copy of the spec for a not too
unreasonable fee.

Patrick Geoffray

unread,

Oct 20, 2004, 3:38:13 AM10/20/04

to

TC wrote:

> For those that want more details... here is some additional info on the ASI
> route header.
>
> The ASI route header is 8 bytes. The ASI route header contains everything
> switches need to route packets and a field that identifies the format of the
> encapsulated packet (the field is called a Protocol Interface, or PI).

ASI claims to do source-based routing, but the route header is constant
size. So I guess the crossbar will use this routing ID to lookup the
routing information in a compact table. So the size of this lookup table
in the crossbar will limit scalability of the fabric. Well, I don't
think having a 8K SRAM in the chip is expensive, nothing compared to the
complexity introduces by QoS and prioritized queues...

> The link layer adds 8 bytes of overhead...
> - a start byte and stop byte for packet delineation (framing)
> - two bytes that contain a packet sequence ID
> - four bytes for a 32-bit link layer CRC

So the CRC is in the header ? Interesting. Does each crossbar has to
check the CRC when switching a packet ? It's very useful to be able to
do that in order to find bad links (if the CRC is checked only at the
end, it does not tell you at which hop the corruption occured). And it's
usually easier to do that when the CRC is at the end of the packet.

Does the CRC covers the route and link headers ?

Thanks.

Patrick

Patrick Schaaf

unread,

Oct 20, 2004, 3:46:51 AM10/20/04

to

Patrick Geoffray <pat...@myri.com> writes:

>Does each crossbar has to check the CRC when switching a packet ?

[...]

>it's usually easier to do that when the CRC is at the end of the packet.

How so? I can see that it is easier (potentially lower latency) to
generate a trailing CRC at the sending site, but checking shouldn't
care much about where it sits, no?

best regards
Patrick

Nick Maclaren

unread,

Oct 20, 2004, 5:37:39 AM10/20/04

to

In article <YOhdd.1674$5i5....@newsread2.news.atl.earthlink.net>,

"TC" <no...@noplace.com> writes:
|> Nick, very interesting perspective on RDMA. I'd like to understand the
|> issues in more detail should you have time to explain it.

You can try searching on old threads, but here is the gist.

There are many problems with allowing I/O devices direct access
to memory (RDMA), including that of locking the physical memory
while they are in use. InfiniBand makes this particularly
obtrusive by (effectively) keeping its own copy of the virtual
to physical mapping and ownership in the device.

This prevents processes from being swapped cleanly, prevents
clean defragmentation of physical memory, and is a serious RAS
problem. If an I/O device goes AWOL on a large SMP system, what
does the kernel do with the memory it was 'allocated'? If it
reuses it, the device may suddenly come to life and trash it.
If it doesn't, it loses memory and/or needs extra logic to
handle the case of the device being replaced.

Traditionally this has been done solely in two circumstances:
device drivers and specialised applications (typically HPC MPI
or databases). In both cases, the use is well controlled, and
neither swapping nor defragmentation are wanted - i.e. the tuned
use is to pin the critical data for the life of the application
(or driver) anyway.

Allowing it to be used by arbitrary programs, especially with
POSIX's ghastly specification of asynchronous I/O (the relevant
interface) is a completely different kettle of fish. A program
can perfectly well pin a section of a thread's stack or a shared
library, pin a lot of scattered locations and so on. There are
no POSIX specifications for controlling any of that, and I have
not seen any in current Unices.

All current implementations that I have heard of 'solve' the
problem simply by supporting only the traditional uses, which
we know how to implement. There is another solution, which I
favour and have posted, but it still doesn't allow arbitrary use
of RDMA.

Regards,
Nick Maclaren.

Patrick Geoffray

unread,

Oct 20, 2004, 5:40:20 AM10/20/04

to Patrick Schaaf

Well, you usually compute the CRC as the packet goes through the
crossbar. So, if it's a the end o the packet, you stop the CRC
computation just before the CRC itself and you can compare with the
value that is just under your fingers.

If the CRC is in the header, you need to read it as the packet goes
through and keep it on the side so you can compare it with the generated
CRC when the whole packet as been switched. Unless you do store and
forward and you can access the CRC value anytime but I thought ASI does
wormhole switching.

It's possible that the CRC is not checked by the crossbars, so you don't
have the wormhole switching constraints and then it does not matter much
if the CRC is at the beginning or the end of the packet.

However, the CRC cannot cover itself, so its odd that it's in the middle
of the headers. If this CRC does protect the headers, what does ?

Patrick

Patrick Schaaf

unread,

Oct 20, 2004, 6:18:03 AM10/20/04

to

Patrick Geoffray <pat...@myri.com> writes:

>>>it's usually easier to do that when the CRC is at the end of the packet.
>>
>> How so? I can see that it is easier (potentially lower latency) to
>> generate a trailing CRC at the sending site, but checking shouldn't
>> care much about where it sits, no?

>Well, you usually compute the CRC as the packet goes through the
>crossbar. So, if it's a the end o the packet, you stop the CRC
>computation just before the CRC itself and you can compare with the
>value that is just under your fingers.

>If the CRC is in the header, you need to read it as the packet goes
>through and keep it on the side so you can compare it with the generated
>CRC when the whole packet as been switched. Unless you do store and
>forward and you can access the CRC value anytime but I thought ASI does
>wormhole switching.

Even then, putting aside the CRC while it flows through wouldn't be
that much hassle, I think. Maybe it's even an advantage to be able
to preload one input of the comparator with the packet's CRC value,
speeding up the final comparison when the computed CRC is done?

best regards
Patrick

Nick Maclaren

unread,

Oct 20, 2004, 6:30:28 AM10/20/04

to

In article <41763284...@myri.com>,

Patrick Geoffray <pat...@myri.com> writes:
|>
|> Well, you usually compute the CRC as the packet goes through the
|> crossbar. So, if it's a the end o the packet, you stop the CRC
|> computation just before the CRC itself and you can compare with the
|> value that is just under your fingers.

That was important way back when, but has not been a serious issue
for 20 years and more.

|> If the CRC is in the header, you need to read it as the packet goes
|> through and keep it on the side so you can compare it with the generated
|> CRC when the whole packet as been switched. Unless you do store and
|> forward and you can access the CRC value anytime but I thought ASI does
|> wormhole switching.

With an appropriate error recovery specification, you can do delayed
error reporting and still get things right. It just needs the right
design.

|> However, the CRC cannot cover itself, so its odd that it's in the middle
|> of the headers. If this CRC does protect the headers, what does ?

Eh? A simple checksum, of ANY form, automatically covers itself.
The check is semantically "Are the data and the checksum compatible?
Please reply yes or no."

Also, with invertible hashes like CRCs, it is easy to define and
write them so that the check calculates them over the whole packet
(including themselves) and then does the check. That isn't possible
with cryptographic ones.

Regards,
Nick Maclaren.

Joachim Worringen

unread,

Oct 20, 2004, 7:21:42 AM10/20/04

to

Del Cecchi in <2t5e0lF...@uni-berlin.de>:
not a cache coherent thing, more like serial PCI. So it isn't a
> NUMA
> appropriate interconnect. Aren't all NUMA interconnects proprietary? I
> can't think of one that isn't. (well, SCI isn't, but I don't know anyone
> using it anymore without modifications)(but I could easily be mistaken)

AFAIK, the Dolphin PCI-SCI adapters talk IEEE-SCI (w/o coherence, of course)
[see http://www.dolphinics.com]. But this is an I/O interconnect.

Joachim

--
reply to joachim at domain ccrl-nece dot de

Opinion expressed is personal and does not constitute
an opinion or statement of NEC Laboratories.

Robert Myers

unread,

Oct 20, 2004, 8:27:30 AM10/20/04

to

TC wrote:

> Are we back to evil plots again?
>
> Anyone that has worked in large companies, Intel included, knows all too
> well that it has virtually impossible to get the whole company to agree on
> anything. I think PCIe was motivated by the need for compatibility the
> legacy (PC) software infrastructure. But, nobody ever lets me into the
> secret club so I don't really know about these things.
>

It's an evil plot if the company tries to protect enough turf so that it
can stay in business?

Check out Appendix A.2 of

http://www-library.lbl.gov/docs/PUB/5500/PDF/PUB-5500.pdf

It's easy to reproduce the public version here:

A.2 ViVA Technical Overview
[Proprietary information deleted.]

Rather than being a nominally cooperative industry standards group, LBL
and the DoE are instruments of U.S. industrial policy.

All appearances to the contrary, I don't really know what to think.
It's a hard problem. Every manufacturer wants its solution to be the
industry standard solution, and it wants to have at least a damn good
head start. If manufacturers can't protect turf, investment will go
elsewhere.

To return to the original question of Intel's maneuvering in the
interconnect business, I could find any number of statements from Intel
signalling its intent to capture ever larger pieces of real estate
working from the processor out. When Intel left infiniband to Mellanox
et. al., you just knew they had a plan B, and that's just business.

RM

Del Cecchi

unread,

Oct 20, 2004, 9:31:07 AM10/20/04

to

"Patrick Geoffray" <pat...@myri.com> wrote in message
news:41763284...@myri.com...

And it is nice to be able to stomp the crc when an error is detected. Can't
do that if the CRC is long gone by the time you know.

The overly complicated IB has 2 CRCs. :-)

del cecchi

Stephen Fuld

unread,

Oct 20, 2004, 1:51:07 PM10/20/04

to

"Nick Maclaren" <nm...@cus.cam.ac.uk> wrote in message

news:cl5bl3$qpi$1...@gemini.csx.cam.ac.uk...

>
> In article <YOhdd.1674$5i5....@newsread2.news.atl.earthlink.net>,
> "TC" <no...@noplace.com> writes:
> |> Nick, very interesting perspective on RDMA. I'd like to understand the
> |> issues in more detail should you have time to explain it.
>
> You can try searching on old threads, but here is the gist.
>
> There are many problems with allowing I/O devices direct access
> to memory (RDMA), including that of locking the physical memory
> while they are in use.

Presumably you mean direct access to *user* memory, as any I/O device has
access to memory. And, since the memory must be locked at least during the
I/O operation, again, I presume what you are objecting to is the capability
of locking memory over longer periods, even when I/O is not in progress.
Are these right?

> InfiniBand makes this particularly
> obtrusive by (effectively) keeping its own copy of the virtual
> to physical mapping and ownership in the device.

Well, the device keeps a copy of the adddress of a queue pair descriptor,
not directly the address into which it is going to do the transfer. The
queue descriptor has the actual transfer address. That is an important
distinction, as it is a way of fixing one of the problems you discuss below.

> This prevents processes from being swapped cleanly, prevents
> clean defragmentation of physical memory, and is a serious RAS
> problem. If an I/O device goes AWOL on a large SMP system, what
> does the kernel do with the memory it was 'allocated'?

It kills the queue pair descriptor, which will prevent any further transfers
for that operation. (see above)

> If it
> reuses it, the device may suddenly come to life and trash it.
> If it doesn't, it loses memory and/or needs extra logic to
> handle the case of the device being replaced.
>
> Traditionally this has been done solely in two circumstances:
> device drivers and specialised applications (typically HPC MPI
> or databases).

Yes.

> In both cases, the use is well controlled, and
> neither swapping nor defragmentation are wanted - i.e. the tuned
> use is to pin the critical data for the life of the application
> (or driver) anyway.
>
> Allowing it to be used by arbitrary programs, especially with
> POSIX's ghastly specification of asynchronous I/O (the relevant
> interface) is a completely different kettle of fish. A program
> can perfectly well pin a section of a thread's stack or a shared
> library, pin a lot of scattered locations and so on. There are
> no POSIX specifications for controlling any of that, and I have
> not seen any in current Unices.

I can't talk about the Posix spec, but I will certainly believe you. I
don't think most people would support arbitrary use of RDMA for every
application. For traditional use, the OS would register the area of its
buffer cache (which isn't swapped or defragged) and the RDMA transfers would
occur there. Then the traditional mechanisms would move the data to the
user space.

But if you look at say IBM's access methods, the buffers are allocated in
the user's space and the I/O is done directly to them. In that case, the
major difference between that and RDMA is that the buffers would typically
be pinned during the time that no i/o was in process. That is the tradeoff
for reducing the overhead of initiating the i/o. It is not always a good
tradeoff.

But for the kinds of applications that you mention above (HPC-MPI and
databases), the reduction in i/o overhead is a worthwhile for trading off
that the memory is pinned, since, as you point out, it is usually pinned
anyway.

Nick Maclaren

unread,

Oct 20, 2004, 2:06:00 PM10/20/04

to

In article <Myxdd.14816$OD2....@bgtnsc05-news.ops.worldnet.att.net>,

Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:
>
>> There are many problems with allowing I/O devices direct access
>> to memory (RDMA), including that of locking the physical memory
>> while they are in use.
>
>Presumably you mean direct access to *user* memory, as any I/O device has
>access to memory. And, since the memory must be locked at least during the
>I/O operation, again, I presume what you are objecting to is the capability
>of locking memory over longer periods, even when I/O is not in progress.
>Are these right?

No, sorry.

I mean from as well as to, because the harm isn't just the access,
but the pinning. You don't want the kernel to move a page while it
is being written :-) There were also some 'interesting' security
exposures here under MVT, cured by removing the facility in MVS.

And the problem with allowing it 'just' when I/O is in progress is
that is not a bounded timestep. It just about is for local disks,
which was traditionally assumed by Unix, but it really, really isn't
for sockets ....

>> This prevents processes from being swapped cleanly, prevents
>> clean defragmentation of physical memory, and is a serious RAS
>> problem. If an I/O device goes AWOL on a large SMP system, what
>> does the kernel do with the memory it was 'allocated'?
>
>It kills the queue pair descriptor, which will prevent any further transfers
>for that operation. (see above)

But it doesn't stop a supposedly abandoned transfer from being
completed, and accessing a queue pair descriptor isn't an 'active'
operation, so you can't tell if it has been done. At least as I
understand the specification.

>But if you look at say IBM's access methods, the buffers are allocated in
>the user's space and the I/O is done directly to them. In that case, the
>major difference between that and RDMA is that the buffers would typically
>be pinned during the time that no i/o was in process. That is the tradeoff
>for reducing the overhead of initiating the i/o. It is not always a good
>tradeoff.

They were in MVT, which did not support virtual memory. In MVS, that
required privilege, precisely to stop the problems I mentioned. Yes,
I know that it wasn't that simple, and SOME direct access was used,
but it was under the control of the supervisor side of EXCP and no
longer under the control of the access method macros.

>But for the kinds of applications that you mention above (HPC-MPI and
>databases), the reduction in i/o overhead is a worthwhile for trading off
>that the memory is pinned, since, as you point out, it is usually pinned
>anyway.

Yes, precisely. In HPC, we go to great efforts to map one process/thread
to a CPU, map memory close to the CPU that 'owns' it, and block even
a chance of paging or swapping.

Regards,
Nick Maclaren.

Patrick Geoffray

unread,

Oct 20, 2004, 7:09:49 PM10/20/04

to Nick Maclaren

Nick Maclaren wrote:
> Eh? A simple checksum, of ANY form, automatically covers itself.
> The check is semantically "Are the data and the checksum compatible?
> Please reply yes or no."

Yes, but how do you compute it ? I mean, if you have the value 0 in the
CRC field and you feed the whole packet to a CRC engine, writing the
value of the checksum in the CRC field modify the content of the packet
and invalidate the CRC you just computed. I still don't know if this CRC
covers the whole packet, including the route and link headers or just
the payload (in this case it's a non-issue).

> Also, with invertible hashes like CRCs, it is easy to define and
> write them so that the check calculates them over the whole packet
> (including themselves) and then does the check. That isn't possible
> with cryptographic ones.

I have to trust you on this because my knowledge of CRC properties is
close to none.

Patrick

del cecchi

unread,

Oct 20, 2004, 10:16:01 PM10/20/04

to

"Patrick Geoffray" <pat...@myri.com> wrote in message

news:4176F03D...@myri.com...

One question would be "is every bit in the packet invariant as the
packet traverses the network?" Some protocols modify stuff in the
header like routing stuff. That invalidates the CRC, requiring a new
one to be inserted. Thus the packet would be unprotected from the time
it is checked until the new CRC is generated if the is but one crc and
packets are not invariant.

del cecchi

unread,

Oct 20, 2004, 10:21:22 PM10/20/04

to

"Robert Myers" <rmyer...@comcast.net> wrote in message
news:SQsdd.160423$He1.39482@attbi_s01...

> TC wrote:
>
> > Are we back to evil plots again?

Hey, you are the one that brought up evil. :-) But now that you
mention it......

>
> It's an evil plot if the company tries to protect enough turf so that
it
> can stay in business?

Pooooor Intel, barely hanging on by their fingernails.

>
> Check out Appendix A.2 of
>
> http://www-library.lbl.gov/docs/PUB/5500/PDF/PUB-5500.pdf
>
> It's easy to reproduce the public version here:
>
> A.2 ViVA Technical Overview
> [Proprietary information deleted.]
>
> Rather than being a nominally cooperative industry standards group,
LBL
> and the DoE are instruments of U.S. industrial policy.
>
> All appearances to the contrary, I don't really know what to think.
> It's a hard problem. Every manufacturer wants its solution to be the
> industry standard solution, and it wants to have at least a damn good
> head start. If manufacturers can't protect turf, investment will go
> elsewhere.
>
> To return to the original question of Intel's maneuvering in the
> interconnect business, I could find any number of statements from
Intel
> signalling its intent to capture ever larger pieces of real estate
> working from the processor out. When Intel left infiniband to
Mellanox
> et. al., you just knew they had a plan B, and that's just business.
>
> RM

The question is not whether they had a plan B, the question is was it
plan A all along? I worked for IBM and watched Microsoft. I know a
plot when I see one. :-) Although sometimes I see plots that aren't
there.

del
>

Aaron Spink

unread,

Oct 21, 2004, 1:39:52 AM10/21/04

to

"Patrick Geoffray" <pat...@myri.com> wrote in message
news:4176F03D...@myri.com...
> Nick Maclaren wrote:

> Yes, but how do you compute it ? I mean, if you have the value 0 in the
> CRC field and you feed the whole packet to a CRC engine, writing the
> value of the checksum in the CRC field modify the content of the packet
> and invalidate the CRC you just computed. I still don't know if this CRC
> covers the whole packet, including the route and link headers or just
> the payload (in this case it's a non-issue).
>

Well think of it this way...

You have two things, the packet to be transfered (includes the header and
any related data payload) and the CRC checksum for the packet.

We'll further define there to be a Sent_CRC, a Sent_Data, a Recv_CRC, and a
Recv_Data.

The Sent_CRC is generated by using the CRC function on the Sent_Data. Both
the Send_CRC and the Sent_Data are sent over the unreliable medium and
become the Recv_CRC and the Recv_Data.

At the receiver we generate a new CRC value, the CHK_CRC, using the CRC
function on the Recv_Data. We then do a comparison between the Recv_CRC and
the CHK_CRC. If they match, then the packet transmission was error free.
If they don't match then there was a transmission error in either the
Sent_CRC to Recv_CRC or Sent_Data to Recv_Data paths. It doesn't matter if
the error was in the CRC or in the Data path as both will generally result
in some error recovery mechanism kicking in.

For protocols that handle long packets and/or widely variable packet sizes,
it is sometimes a good idea to use more than one CRC. In these systems, one
CRC is used to check the header which is usually a known small quantity
(from 8-24 bytes) and another is used to check the payload which can vary
widely (64-16384 bytes). These CRCs are maintained, generated and checked
completely independantly throughout transmission. Generally the CRC
protecting the header is of a smaller size (16 or 24 bits) because of the
limited amount of data it is covering whereas the payload CRC is generally
32 bits.

Aaron Spink
speaking for myself inc.

Iain McClatchie

unread,

Oct 21, 2004, 3:20:12 AM10/21/04

to

Checksums:

IP has the checksum near the beginning of the packet. This is a bummer
because the transmit side has to scan the entire contents of the packet
before it sends that checksum. You can buffer the packet in a local
SRAM (after all, it can only be 8KB, right?), but you are still messing
around when you could be pushing bits. Checksum at the beginning adds
latency.

This is enough of a problem that at least some TCP/IP implementations
(and now I forget which ones) have arranged that there are two (or
four? I think it's two) unused payload bytes at the end of the packet.
That is, the encapsulating thing around the data lies and says the data
is two bytes bigger than it is. When the packet is sent, the tranmitter
just stuffs zeros or whatever into the checksum field. When it gets to
the end, it fills those last two bytes with whatever it takes to make
the checksum come out.

If the checksum had been a crytographic hash, this algorithm wouldn't be
possible. But it turns out that IP has a pretty trivial, and more
importantly, invertible hash function.

The invertible hash function is good for other things too. Sometimes
a router will need to rewrite some of the fields in the header. I know
they rewrite the level-2 destination address, I think they append extra
tags for ATM-like routing (some sort of Cisco invention), and I suspect
there are plenty of other changes too, not even getting into stuff like
NAT.

Anyway, because the hash is invertible, the router can rewrite bits of
the header on the way through, determine the change to the checksum, and
apply that change without scanning the rest of the message. Voila, if
the checksum was valid before, it's still valid.

Oh: the invertibility bit makes the hash not as good at catching bit
errors. Certain not-very-unlikely error patterns will pass silently.

Now what I don't know, is: can a router figure out from just the header
bits whether the extra two bytes have been allocated at the end of the
packet, in which case it can do fast path routing? I suppose it can,
but I suspect it has to know how to parse the tags for the packet payload
in order to see how long the payload thinks it is, versus how many bytes
are being sent. So this extra-big wrapper idea isn't completely general.

IIRC. Whew. I really don't know very much about this stuff.

That said, if I were designing a packet system, I'd have one checksum at
the end of the packet, and have it cover all the bits in the packet and
header. And I'd probably make it cryptographic (but short). Nowadays,
having hardware recompute the checksum on the way through after the packet
header has been rewritten seems like hardly any overhead. Of course, the
payload is going to need it's own secure crypto hash that doesn't get
rewritten, and that hash is going to be 160 bits or more.

Kim Enkovaara

unread,

Oct 21, 2004, 3:56:22 AM10/21/04

to

Iain McClatchie wrote:

> Checksums:
>
> IP has the checksum near the beginning of the packet. This is a bummer
> because the transmit side has to scan the entire contents of the packet
> before it sends that checksum. You can buffer the packet in a local
> SRAM (after all, it can only be 8KB, right?), but you are still messing
> around when you could be pushing bits. Checksum at the beginning adds
> latency.

In IP the checksum is only for the header and maximum header length is 64
bytes. 64 byte buffer is quite small, and in cell based architectures that
could be the cell size, and no overhead is needed. Just forward the result
to next cellpipeline stage that modifies the cell.

IP-packet can be 64k in size, but of course the vendor can set some
other equipment level mtu value to save internal memory.

> Anyway, because the hash is invertible, the router can rewrite bits of
> the header on the way through, determine the change to the checksum, and
> apply that change without scanning the rest of the message. Voila, if
> the checksum was valid before, it's still valid.

It's usually easier just to verify the checksum in ingress and recauculate
it in egress. Of course internal backplane crcs etc. are needed to make
sure that node is not corrupting the traffic itself. If tens of different
blocks are modifying the headers, each would need the checksum modification
logic.

--Kim

Nick Maclaren

unread,

Oct 21, 2004, 5:34:18 AM10/21/04

to

In article <4176F03D...@myri.com>, Patrick Geoffray <pat...@myri.com> writes:
|> Nick Maclaren wrote:
|> > Eh? A simple checksum, of ANY form, automatically covers itself.
|> > The check is semantically "Are the data and the checksum compatible?
|> > Please reply yes or no."
|>
|> Yes, but how do you compute it ? I mean, if you have the value 0 in the
|> CRC field and you feed the whole packet to a CRC engine, writing the
|> value of the checksum in the CRC field modify the content of the packet
|> and invalidate the CRC you just computed. I still don't know if this CRC
|> covers the whole packet, including the route and link headers or just
|> the payload (in this case it's a non-issue).

The solution that is used in the tar format is to zero the checksum
field and then calculate it. When checking, you clearly need to
take a copy first!

To do the other form (i.e. the checksum value is used to calculate
the checksum) you use the algebraic properties of the transform.
For example, if you are using a simple sum of words that must be
a constant Z (a ghastly checksum, very popular in older Unix
facilities):

X <= sum of all words with checksum zero
checksum <= Z-X

In THIS case making the checksum match itself is impossible, but
it is generally possible for invertible hashes by solving a simple
equation.

Regards,
Nick Maclaren.

Nick Maclaren

unread,

Oct 21, 2004, 5:42:14 AM10/21/04

to

In article <45022fc8.04102...@posting.google.com>,

iai...@truecircuits.com (Iain McClatchie) writes:
|>
|> Anyway, because the hash is invertible, the router can rewrite bits of
|> the header on the way through, determine the change to the checksum, and
|> apply that change without scanning the rest of the message. Voila, if
|> the checksum was valid before, it's still valid.

Which is the same principle as making the checksum include itself.

|> Oh: the invertibility bit makes the hash not as good at catching bit
|> errors. Certain not-very-unlikely error patterns will pass silently.

No, this is NOT true. It is because it is a BAD invertible checksum!

The one that I have used for ages does not suffer from that problem,
and nor does any competently designed one. For heavens sake, this
was a solved problem before 1930! Yes, I know that 99% of the ones
used in modern computer software are bad, but that isn't the same
as saying that they HAVE to be.

Any N-bit checksum will have the problem that there is a 2^-N chance
of an error being undetected, and with 'randomising' checksums you
can use geometric analyses to estimate how many of them are likely
to be not-very-unlikely. But there is NO difference in that between
a GOOD invertible checksum and a non-invertible one, except that the
latter will resist malic whereas the former won't.

Regards,
Nick Maclaren.

Robert Myers

unread,

Oct 21, 2004, 8:37:10 AM10/21/04

to

del cecchi wrote:

> "Robert Myers" <rmyer...@comcast.net> wrote in message
> news:SQsdd.160423$He1.39482@attbi_s01...
>

<snip>

>>
>>To return to the original question of Intel's
>>maneuvering in the interconnect business, I could find any number
>>>of statements from Intel
>>signalling its intent to capture ever larger pieces of real estate
>>working from the processor out. When Intel left infiniband to
>>Mellanox et. al., you just knew they had a plan B,
>>and that's just business.
>
>

> The question is not whether they had a plan B, the question is was it
> plan A all along? I worked for IBM and watched Microsoft. I know a
> plot when I see one. :-) Although sometimes I see plots that aren't
> there.
>

In my book, a plot is something that should attract the attention of the
Department of Justice or the Federal Trade Commission.

What looked at one point to have been a coming tidal wave of change in
interconnect has turned into a disappointing series of breakers you
couldn't even get a surf board onto. I don't find it beyond credibility
that Intel's maneuvering is at least partly responsible and that Intel
has been maneuvering purely for competitive reasons. If Intel can push
the market around and if, say, IBM can't keep them from doing it in a
way that is detrimental to its interests, that's worth knowing, but I
don't know that it deserves a moral label.

It seems more important to try to understand what might come next than
to put labels on past events, anyway. As I understand it, Intel has
significantly increased its presence in the network processor market.
Marginalizing hypertransport has to be an important goal for Intel.
Intel's /sang/ /froid/ isn't 100% convincing, but it doesn't seem to be
_too_ worried about the battle over microprocessor performance.
Infiniband is being reincarnated as Advanced Switching with an unusually
friendly relationship with PCI-express. Maybe I'm the one who's seeing
things.

RM

Iain McClatchie

unread,

Oct 21, 2004, 2:20:45 PM10/21/04

to

Kim> In IP the checksum is only for the header and maximum header length is 64
Kim> bytes. 64 byte buffer is quite small, and in cell based architectures that
Kim> could be the cell size, and no overhead is needed. Just forward the result
Kim> to next cellpipeline stage that modifies the cell.

Oops, I screwed up. You're quite right, the IP checksum covers just the
header.

The TCP checksum, however, is in the beginning of the message and covers the
data. This is what I was thinking of. And I think the hack must be to have
the IP header effectively specify two more bytes than the TCP header
specifies for data, to make room for a checksum at the end.

Kim> IP-packet can be 64k in size, but of course the vendor can set some
Kim> other equipment level mtu value to save internal memory.

Didn't ethernet set an MTU of around 1500 bytes, and GigE an MTU of around
8KB? If your end-to-end route goes through any ethernet anywhere, this MTU
is what you'll be working with.

Nick Maclaren

unread,

Oct 21, 2004, 2:42:05 PM10/21/04

to

In article <45022fc8.04102...@posting.google.com>,

Iain McClatchie <iai...@truecircuits.com> wrote:
>
>Kim> IP-packet can be 64k in size, but of course the vendor can set some
>Kim> other equipment level mtu value to save internal memory.
>
>Didn't ethernet set an MTU of around 1500 bytes, and GigE an MTU of around
>8KB? If your end-to-end route goes through any ethernet anywhere, this MTU
>is what you'll be working with.

Er, no. Ethernet is 1500 bytes. "Jumbo frames" are c. 9K, are slightly
non-standard, and have only recently become usable. Many switches and
systems still don't support them, or don't work with them.

Also, the Ethernet MTU is not visible at the TCP/UDP layer, though its
existence creates a healthy amount of confusion and inefficiency.

There were also some, er, 'interesting' historical issues with 64 KB
versus 64 KB minus a little versus 32 KB - 1 versus larger sizes
for some uses. Merciful oblivion has covered most of what I once
knew, and I am disinclined to stir that noisesome cesspit again
unless I have to.

Regards,
Nick Maclaren.

Alex Colvin

unread,

Oct 21, 2004, 4:18:04 PM10/21/04

to

>And it is nice to be able to stomp the crc when an error is detected. Can't
>do that if the CRC is long gone by the time you know.

can't you just stomp the end of packet, which alters the crc?

--
mac the naïf

Alex Colvin

unread,

Oct 21, 2004, 4:23:29 PM10/21/04

to

>In my book, a plot is something that should attract the attention of the
>Department of Justice or the Federal Trade Commission.

Ha! these days we have to rely on the state Attys General.

--
mac the naïf

TC

unread,

Oct 21, 2004, 8:35:32 PM10/21/04

to

> Thanks for finding the bylaws. I wonder if they bothered with any
> protection in their architecture. The IDF papers are a good suggestion.

If you are asking about methods to prevent access to nodes, or specific
resources in nodes, then the answer is yes. There are both general methods
and methods that are protocol specific.

But I'm not sure I am headed in the right direction here... is this the kind
of protection you were asking about?

TC

"del cecchi" <dcecchi...@att.net> wrote in message
news:2tm2bkF...@uni-berlin.de...
>

> "TC" <no...@noplace.com> wrote in message
> news:W0idd.3063$ta5....@newsread3.news.atl.earthlink.net...
>> I think there might be more information available than you may have
>> realized. It seemed odd to me that the ASI SIG would not post by-laws
> so I
>> went to the main page..
>>
>> http://asi-sig.org/home
>>
>> clicked on the "join" link which brought me to...
>>
>> http://asi-sig.org/join
>>
>> and found a link to the by-laws on that page...
>>
>> http://www.asi-sig.org/join/asisig_bylaws.pdf
>>
>> I didn't specifically checked but I am fairly certain that
> Intellectual
>> Property rights are treated in a manner very similar to other
> standards
>> organizations (reasonable and non-discriminatory licensing terms for
>> contributed IP). Because the ASI SIG's orgins were the Arapahoe Work
> Group I
>> don't think there is any evil plot by Intel involved here.
>>
>> On the ASI SIG web site you will also find a link to an education
> page.
>> Again, I don't really know what is there. One other source for public
>> information that I would check out is the Intel IDF presentations. I
> hope
>> this helps people that are looking for more information.
>>
>> TC

>>
> Thanks for finding the bylaws. I wonder if they bothered with any
> protection in their architecture. The IDF papers are a good suggestion.
>

> del cecchi
>
>

TC

unread,

Oct 21, 2004, 8:45:26 PM10/21/04

to

Ummm... I think my description may not have been clear enough.

The 'packet' has a header and payload.

The link layer prepends the following (in front of the packet)...
- a start symbol for packet deliniation (framing)
- a two byte field containing a sequence ID (for link error recovery)

Then comes the ASI header (8 bytes)

Then comes the payload (multiples of 4 bytes)

Then the link layer appends (at the end of the packet)...
- a 32-bit CRC
- a stop symbol for packet deliniation (framing)

The link layer CRC is recomputed each time a packet is forwarded to a new
link (it is a hop-by-hop CRC).

There are provisions for an additional, but optional, 32-bit end-to-end CRC
as well that is invariant from the source to the target.

So, the link-layer CRC is not in the header. However, just to make it
confusing, there is a small, seperate CRC contained within the header to
cover specific bits in the header.

Also, packet cut-through is supported. This means that there is a method for
killing a corrupt packet that is forwarded, or a packet that is forwarded
incorrectly.

TC

"Patrick Geoffray" <pat...@myri.com> wrote in message

news:cl54l2$1...@flex.myri-local.com...

TC

unread,

Oct 21, 2004, 8:53:58 PM10/21/04

to

And I forgot to add...

ASI does use path routing for Unicast packets. With path routing the
switches do not have any look up tables for unicast routes. The packet tells
the switch where it is going. One down side of this is that the route also
has to have an index (I think ASI calls it a turn pointer). This index is
updated by each switch as the packet propagates through the fabric. Thus
some bits of the header are variable while others are invariant. One benefit
of path routing is that the switches do not have to be configured to be able
to route.

Multicast packets do use a multicast group ID which indexes a table in the
switch which port the packet is to be propagated to. There is also an input
port field that indicates which ports a given multicast ID is permitted to
be received on. These tables must be configured before multicast packets can
be routed.

TC

unread,

Oct 21, 2004, 8:58:35 PM10/21/04

to

Thanks for the summary. I'll search old threads as you suggested. I'll have
to see what I can learn about the issue.

TC

"Nick Maclaren" <nm...@cus.cam.ac.uk> wrote in message
news:cl5bl3$qpi$1...@gemini.csx.cam.ac.uk...
>

del cecchi

unread,

Oct 21, 2004, 9:19:35 PM10/21/04

to

"Alex Colvin" <al...@TheWorld.com> wrote in message
news:cl95hs$e2a$1...@pcls4.std.com...

Normally (the first time I saw this was in SCI) the "stomp" created by
replacing the previous crc with a different crc such as the actual value
for the crc xored with some value. That makes a very distinctive
result. When the distinctive characteristic is detected one knows it
was detected as bad at the previous node. One doesn't need to just add
errors. the packet is already bad. changing the last bite just means
it is still bad.

del cecchi

unread,

Oct 21, 2004, 9:20:23 PM10/21/04

to

"TC" <no...@noplace.com> wrote in message

news:oBYdd.3956$5i5....@newsread2.news.atl.earthlink.net...

> > Thanks for finding the bylaws. I wonder if they bothered with any
> > protection in their architecture. The IDF papers are a good
suggestion.
>
> If you are asking about methods to prevent access to nodes, or
specific
> resources in nodes, then the answer is yes. There are both general
methods
> and methods that are protocol specific.
>
> But I'm not sure I am headed in the right direction here... is this
the kind
> of protection you were asking about?
>
> TC
>

Yes, thanks. That is what I was referring to.

del

Stephen Fuld

unread,

Oct 22, 2004, 2:18:58 AM10/22/04

to

"Nick Maclaren" <nm...@cus.cam.ac.uk> wrote in message

news:cl69e8$19p$1...@gemini.csx.cam.ac.uk...

> In article <Myxdd.14816$OD2....@bgtnsc05-news.ops.worldnet.att.net>,
> Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:
>>
>>> There are many problems with allowing I/O devices direct access
>>> to memory (RDMA), including that of locking the physical memory
>>> while they are in use.
>>
>>Presumably you mean direct access to *user* memory, as any I/O device has
>>access to memory. And, since the memory must be locked at least during
>>the
>>I/O operation, again, I presume what you are objecting to is the
>>capability
>>of locking memory over longer periods, even when I/O is not in progress.
>>Are these right?
>
> No, sorry.
>
> I mean from as well as to, because the harm isn't just the access,
> but the pinning. You don't want the kernel to move a page while it
> is being written :-)

I didn't mention reads versus writes. Yes the same, or similar
considerations apply to both. What are you trying to say that I am missing?

> There were also some 'interesting' security
> exposures here under MVT, cured by removing the facility in MVS.
>
> And the problem with allowing it 'just' when I/O is in progress is
> that is not a bounded timestep. It just about is for local disks,
> which was traditionally assumed by Unix, but it really, really isn't
> for sockets ....

So you are saying that the only solution is to have a single large area
permanently pinned (i.e. a unix style system cache) and move i/o data
through it?

>>> This prevents processes from being swapped cleanly, prevents
>>> clean defragmentation of physical memory, and is a serious RAS
>>> problem. If an I/O device goes AWOL on a large SMP system, what
>>> does the kernel do with the memory it was 'allocated'?
>>
>>It kills the queue pair descriptor, which will prevent any further
>>transfers
>>for that operation. (see above)
>
> But it doesn't stop a supposedly abandoned transfer from being
> completed,

I'm not sure what you mean here. Each packet has the queue pair number and
thus requires access to the queue pair to complete. So if you kill the
queue pair, you stop any further packets. And, of course, at least for I/O
devices, you can send some sort of abort command (details depending on the
device, of course).

> and accessing a queue pair descriptor isn't an 'active'
> operation, so you can't tell if it has been done. At least as I
> understand the specification.

Right. But the OS can kill one to prevent any further accesses. That
limits what happens to any packets actually in the host interface at that
time.

>>But if you look at say IBM's access methods, the buffers are allocated in
>>the user's space and the I/O is done directly to them. In that case, the
>>major difference between that and RDMA is that the buffers would typically
>>be pinned during the time that no i/o was in process. That is the
>>tradeoff
>>for reducing the overhead of initiating the i/o. It is not always a good
>>tradeoff.
>
> They were in MVT, which did not support virtual memory. In MVS, that
> required privilege, precisely to stop the problems I mentioned. Yes,
> I know that it wasn't that simple, and SOME direct access was used,
> but it was under the control of the supervisor side of EXCP and no
> longer under the control of the access method macros.

I never programmed under MVS at that level, but if I had an assembler
program that ran under OS, I could run that same object module under MVT.
What is MVS doing "behind the scenes" to change the pointers I have set up
to do the I/O under OS? What exactly are you sating MVS does? i.e. does it
allocate the buffers? If so, then they must be pinned when you do the I/O,
or you risk a page fault in the middle of your I/O transfer.

Nick Maclaren

unread,

Oct 22, 2004, 5:50:34 AM10/22/04

to

In article <mD1ed.18910$OD2....@bgtnsc05-news.ops.worldnet.att.net>,

Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:
>
>I didn't mention reads versus writes. Yes the same, or similar
>considerations apply to both. What are you trying to say that I am missing?

My mistake. I don't find that asterisk notation very natural, and
misread what you were saying. Yes, you are right, at least if you
assume that "user memory" == "virtual addressing".

>> There were also some 'interesting' security
>> exposures here under MVT, cured by removing the facility in MVS.
>>
>> And the problem with allowing it 'just' when I/O is in progress is
>> that is not a bounded timestep. It just about is for local disks,
>> which was traditionally assumed by Unix, but it really, really isn't
>> for sockets ....
>
>So you are saying that the only solution is to have a single large area
>permanently pinned (i.e. a unix style system cache) and move i/o data
>through it?

Not quite ....

What I am saying is that you need a logical interface by which you
can ensure that the various entities (user code, device drivers
etc.) are appropriate isolated. And (and here is the rub) that
this interface remains tight even following error recovery from
pretty nasty errors.

>I'm not sure what you mean here. Each packet has the queue pair number and
>thus requires access to the queue pair to complete. So if you kill the
>queue pair, you stop any further packets. And, of course, at least for I/O
>devices, you can send some sort of abort command (details depending on the
>device, of course).

The kernel sets up a queue pair
The application (sic) fires up a transfer on it
The application dies or is killed
The kernel calls the device to cancel but gets no response

Question: How far through the second step had the device got?

If it had read the queue pair, then reusing the memory runs the
risk of it coming to life and accessing the physical memory that
it describes. If it hadn't, the memory can be reused safely.

Assume the former. The device is hot-swapped out. Now, the kernel
must remember that the blocked memory can be reused. Oh, yes, it
CAN be done, but it is complex and therefore error-prone.

>I never programmed under MVS at that level, but if I had an assembler
>program that ran under OS, I could run that same object module under MVT.
>What is MVS doing "behind the scenes" to change the pointers I have set up
>to do the I/O under OS? What exactly are you sating MVS does? i.e. does it
>allocate the buffers? If so, then they must be pinned when you do the I/O,
>or you risk a page fault in the middle of your I/O transfer.

This was poorly specified, varied a lot which which version of MVS,
and even more with the type of I/O and privilege level of the caller.
I can no longer remember precisely what I knew.

One of the things that I believe that it did was to copy data areas
if it would have involved pinning things it didn't want to pin, or
(where practical) change the page tables to transfer ownership.
I can't remember if it actually DID the latter, but I saw documents
that implied it was at least left as an implementation option.

Note that the reason that it could do all of that is that the contents
of an area being read from disk were undefined, and that most buffers
were indeed allocated by the system. That wasn't required by the
interface, but was done by many or most I/O interfaces.

Regards,
Nick Maclaren.

Eric

unread,

Oct 22, 2004, 12:04:17 PM10/22/04

to

Nick Maclaren wrote:
>
> The kernel sets up a queue pair
> The application (sic) fires up a transfer on it
> The application dies or is killed
> The kernel calls the device to cancel but gets no response
>
> Question: How far through the second step had the device got?
>
> If it had read the queue pair, then reusing the memory runs the
> risk of it coming to life and accessing the physical memory that
> it describes. If it hadn't, the memory can be reused safely.
>
> Assume the former. The device is hot-swapped out. Now, the kernel
> must remember that the blocked memory can be reused. Oh, yes, it
> CAN be done, but it is complex and therefore error-prone.

Exactly how is this different from the failure of any other DMA device?

BTW, in VMS and WNT it should not be complex to deal with
but it would be up to the driver to decide to release memory.
The page frames would remain pinned while the IO Request Packet (IRP)
was outstanding, and it would remain outstanding while the device
driver was holding it. When the device was hot swapped the device
driver would be sent a special "hot swap" control message and it
would simply complete the I/O with an error code and pass it
back to the OS as usual which would decrement the ref count and
release the page frames. NBD.

I imagine that making sure the OS and all devices understand hot
swapping would probably be the larger problem, particularly in WNT
land.

Eric

Nick Maclaren

unread,

Oct 22, 2004, 12:52:48 PM10/22/04

to

In article <41792F81...@sympaticoREMOVE.ca>,

Eric <eric_p...@sympaticoREMOVE.ca> writes:
|> Nick Maclaren wrote:
|> >
|> > The kernel sets up a queue pair
|> > The application (sic) fires up a transfer on it
|> > The application dies or is killed
|> > The kernel calls the device to cancel but gets no response
|> >
|> > Question: How far through the second step had the device got?
|> >
|> > If it had read the queue pair, then reusing the memory runs the
|> > risk of it coming to life and accessing the physical memory that
|> > it describes. If it hadn't, the memory can be reused safely.
|> >
|> > Assume the former. The device is hot-swapped out. Now, the kernel
|> > must remember that the blocked memory can be reused. Oh, yes, it
|> > CAN be done, but it is complex and therefore error-prone.
|>
|> Exactly how is this different from the failure of any other DMA device?

Not at all.

My point was and is the undesirability of more-or-less arbitrary
(user controlled) RDMA in a multi-purpose, shared use operating
system (especially one using virtual addressing) - and the fact
that computing history shows how and why it is a nightmare. No
more and no less.

There is a minor detail that interfaces used for networking are
typically worse than ones used only for local devices, but it is
not a major matter once you start allowing for device failure.

|> I imagine that making sure the OS and all devices understand hot
|> swapping would probably be the larger problem, particularly in WNT
|> land.

Yup.

Regards,
Nick Maclaren.

Stephen Fuld

unread,

Oct 22, 2004, 1:03:28 PM10/22/04

to

"Nick Maclaren" <nm...@cus.cam.ac.uk> wrote in message

news:clal5a$csi$1...@gemini.csx.cam.ac.uk...

> In article <mD1ed.18910$OD2....@bgtnsc05-news.ops.worldnet.att.net>,
> Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:
>>
>>I didn't mention reads versus writes. Yes the same, or similar
>>considerations apply to both. What are you trying to say that I am
>>missing?
>
> My mistake. I don't find that asterisk notation very natural, and
> misread what you were saying. Yes, you are right, at least if you
> assume that "user memory" == "virtual addressing".

Yes, I was making that assumption. BTW, is there some "official" way to
espress emphasis of a word or phrase in an ASCII only newsgroup posting? I
used to use tildes, that is ~, but that didn't seem to do it, and using
asterisks has the at least small benefit that in Microsoft Word, it renders
the text in bold font, which is really what I want to express. But if there
is some preferred way, I am open to using it.

>>> There were also some 'interesting' security
>>> exposures here under MVT, cured by removing the facility in MVS.
>>>
>>> And the problem with allowing it 'just' when I/O is in progress is
>>> that is not a bounded timestep. It just about is for local disks,
>>> which was traditionally assumed by Unix, but it really, really isn't
>>> for sockets ....
>>
>>So you are saying that the only solution is to have a single large area
>>permanently pinned (i.e. a unix style system cache) and move i/o data
>>through it?
>
> Not quite ....
>
> What I am saying is that you need a logical interface by which you
> can ensure that the various entities (user code, device drivers
> etc.) are appropriate isolated. And (and here is the rub) that
> this interface remains tight even following error recovery from
> pretty nasty errors.

Sure. But that can be done in several ways. But I am not saying it is easy
to get right. :-(

>>I'm not sure what you mean here. Each packet has the queue pair number
>>and
>>thus requires access to the queue pair to complete. So if you kill the
>>queue pair, you stop any further packets. And, of course, at least for
>>I/O
>>devices, you can send some sort of abort command (details depending on the
>>device, of course).
>
> The kernel sets up a queue pair
> The application (sic) fires up a transfer on it
> The application dies or is killed
> The kernel calls the device to cancel but gets no response
>
> Question: How far through the second step had the device got?
>
> If it had read the queue pair, then reusing the memory runs the
> risk of it coming to life and accessing the physical memory that
> it describes. If it hadn't, the memory can be reused safely.
>
> Assume the former. The device is hot-swapped out. Now, the kernel
> must remember that the blocked memory can be reused. Oh, yes, it
> CAN be done, but it is complex and therefore error-prone.

Agreed. It has to be done carefully. But that is true of most device error
recovery.

>>I never programmed under MVS at that level, but if I had an assembler
>>program that ran under OS, I could run that same object module under MVT.
>>What is MVS doing "behind the scenes" to change the pointers I have set up
>>to do the I/O under OS? What exactly are you sating MVS does? i.e. does
>>it
>>allocate the buffers? If so, then they must be pinned when you do the
>>I/O,
>>or you risk a page fault in the middle of your I/O transfer.
>
> This was poorly specified, varied a lot which which version of MVS,
> and even more with the type of I/O and privilege level of the caller.
> I can no longer remember precisely what I knew.

OK. Perhaps someone else who went through this can comment.

Nick Maclaren

unread,

Oct 22, 2004, 1:22:19 PM10/22/04

to

In article <A3bed.19729$OD2....@bgtnsc05-news.ops.worldnet.att.net>,

"Stephen Fuld" <s.f...@PleaseRemove.att.net> writes:
|>
|> Yes, I was making that assumption. BTW, is there some "official" way to
|> espress emphasis of a word or phrase in an ASCII only newsgroup posting? I
|> used to use tildes, that is ~, but that didn't seem to do it, and using
|> asterisks has the at least small benefit that in Microsoft Word, it renders
|> the text in bold font, which is really what I want to express. But if there
|> is some preferred way, I am open to using it.

I use capitals, which is one of the pre-computer conventions, but
there are many others. I have been told that I should convert to
half a dozen (different) 'official' versions :-)

|> > Assume the former. The device is hot-swapped out. Now, the kernel
|> > must remember that the blocked memory can be reused. Oh, yes, it
|> > CAN be done, but it is complex and therefore error-prone.
|>
|> Agreed. It has to be done carefully. But that is true of most device error
|> recovery.

With 'simple' error recovery, you need only to deal with the device
and the memory that it 'owns'. I am assuming that, when it went
AWOL, you broke all connexions from processes to the device. With
user-specifiable RDMA, you have to deal with arbitrary memory that
is in an indeterminate state.

Consider writing a section of a shared library to disk[*]. While
things are working, each process using it bumps the library's use
count by one. When a process is killed with RDMA outstanding, that
use count must NOT be reduced, but its ownership must be transferred
to the device. Even that isn't right, if the library goes out of
use while the device is still AWOL, as you THEN want to free all
pages that aren't tainted and leave the ones that are.

So you have to ensure that the error recovery interacts with a
large number of system components, and it is interactions that cause
hard-to-locate problems far more than single-component bugs.

[*] Yes, I KNOW it's insane, but you have to handle anything that's
allowed.

Regards,
Nick Maclaren.

Greg Lindahl

unread,

Oct 22, 2004, 2:29:20 PM10/22/04

to

In article <GKYdd.5175$ta5....@newsread3.news.atl.earthlink.net>,
TC <no...@noplace.com> wrote:

> However, just to make it
> confusing, there is a small, seperate CRC contained within the header to
> cover specific bits in the header.

How can that be confusing, when it's a normal feature in many
protocols?

>Also, packet cut-through is supported. This means that there is a method for
>killing a corrupt packet that is forwarded, or a packet that is forwarded
>incorrectly.

Wow, now that's a version of "cut-through" that I've never heard of
before. Are you sure that's what it's called?

-- g

ly...@garlic.com

unread,

Oct 22, 2004, 6:21:36 PM10/22/04

to

"Stephen Fuld" <s.f...@PleaseRemove.att.net> wrote in message news:<mD1ed.18910$OD2....@bgtnsc05-news.ops.worldnet.att.net>...

> I never programmed under MVS at that level, but if I had an assembler
> program that ran under OS, I could run that same object module under MVT.
> What is MVS doing "behind the scenes" to change the pointers I have set up
> to do the I/O under OS? What exactly are you sating MVS does? i.e. does it
> allocate the buffers? If so, then they must be pinned when you do the I/O,
> or you risk a page fault in the middle of your I/O transfer.

in the os/360 paradigm ... the application code ... typically actually
some (file access) library routine ... running in the application
region
created a (I/O) channel program (sequence of CCWs). Then it would
execute
a supervisor/kernel (excp) call. The kernel would do some prelim ...
like
if it was a disk request ... prefix the channel program with arm
positioning
operation ... and then directly invoke the application region I/O
channel
program.

In the initial move of MVT to virtual memory ... it was called OS2/SVS
...
single virtual storage ... it was if MVT had 16mbytes of real storage
...
with some underlying stub-code that mapped the MVT 16mbytes (single
virtual address space) to typically much smaller real storage.

The initial prototype for OS2/SVS involved taking MVT, crafting the
stub virtual address space code on the side and borrowing "CCWTRANS"
from CP/67. The issue is that the channel program CCWs all use real
address for transfers. The problem is that the application code
generating the CCW sequence still believes it is generating real
addresses in its channel program CCWs ... when they are all actually
virtual addresses. Now when the application program issued the
(EXCP) kernel call ... instead of directly pointing at the application
channel program code .... the code called the (CP/67) CCWTRANS
routine. This routine
created a "shadow" copy of the user's channel program CCWs ....
checked each of the virtual addresses ... as appropriate made sure the
associated virtual page was resident & pinned in real storage and
translated the virtual address (from the application channel program
CCWs) to the appropriate real address (in the "shadow" channel program
CCWs). The actual I/O that was initiated was the "translated" shadow
channel program CCWs ... no longer the original application channel
program CCWs (a major issue was that real I/O is done with real
addresses, and applications only had virtual addresses to deal with).

This SVS system ... looked like a single virtual address space ...
with the old MVT kernel and all applications/task occupying that
single virtual address space. The transition from SVS (single virtual
storage) to MVS (multiple virtual storage) ... was effectively giving
each application its own virtual address space. This structure
actually had the (same) MVS kernel (image) occupying 8mbytes of every
application virtual address space ... with 7mbytes (of the 16mbytes
address space) available for an application.

There is one mbyte missing. The problem was that in MVT and SVS ...
everything occupied the same address space ... and there was heavy
reliance of pointer passing paradigm. This included numerous
"sub-system" function that were used by applications ... but were
actually part of the kernel. Come MVS ... the application would be
making a call passing a pointer to some application data .... which
would eventually pass thru the kernel and then into a completely
different address space (where the subsystem function was working).
The problem now was that the pointer ... was to an area in a totally
different address space. A work around was created called the (1mbyte)
common segment ... that appears in all virtual address spaces ...
where data could be stuffed away ... and pointers passed ... and they
would be useable ... regardless of which virtual address space was
executing.

The next problem was as MVS systems grew and got more complex ...
there were more and more subsystems that required common segment
space. Very quickly, some installations found themselves with 4mbyte
common (segment) areas .... leaving only a maximum of 4mbytes (out of
16mbytes) in each virtual address space for application execution.

Note that requirement continued in MVS for the application channel
program ccws to real executing CCWs remainted the same requirement ...
that the virtual address space channel program CCWs had to be copied
to shadows CCWs and the
virtual addresses translated to real addresses before starting the I/O
operation

There were some subsystems that were given V=R regions .... where
memory regions were mapped to real storage and the application
subsystem code generated channel program CCWs that had real address
pointing to areas that had fixed real storage allocation. These
channel program CCWs could be treated specially and not have to be
translated ... but executed directly (like things were back on real
memory MVT systems).

Note dual-address space was introduced in the 3033 .... because the
problem with the common (segment) area was becoming so severe ... aka
some installations might soon not have any virtual address space left
to actually run applications. With dual-address space .... a subsystem
would be entered with a secondary address space control register ...
set to the original application program. It then had special
instructions that would user address pointer to fetch/store data from
secondary (application) virtual address space ... rather than the
primary (subsystem) virtual address space.

Then came generalized access registers and program calls. The
original os/360 characteristic had lots of calls to various library
functions just by loaded a register pointer and "branch and link" to
the routine. Later releases of MVS started moving various of this
stuff into their own address space. You could do a kernel call to
effect an address space switch .... to get to the target library code
... but the kernel call represented a very large pathlength increase
(compared to a BALR instruction). The solution was was access
registers and program call instruction. This is basically a
(protected) table of calleable routines setup for an application. The
application can specify an entry in the table and do a program call
instruction. The hardware uses information in the protected program
call table to swizzle the virtual address space control registers and
pass control to the called routine (w/o the overhead of a kernel
call).

TC

unread,

Oct 22, 2004, 11:20:06 PM10/22/04

to

Well... maybe my post wasn't clear. By cut-through I mean cut-through
routing. Specifically, the ability to begin forwarding a packet before the
entire packet has been received and validated. As a consequence, the
received packet may subsequently be found to be corrupt. If it is, the
forwarded packet needs to be stomped (killed).

Hopefully this is a little clearer, but maybe you weren't confused in the
first place.

TC

"Greg Lindahl" <lin...@pbm.com> wrote in message
news:41795180$1...@news.meer.net...

Greg Lindahl

unread,

Oct 22, 2004, 11:44:17 PM10/22/04

to

In article <G5ked.5219$5i5....@newsread2.news.atl.earthlink.net>,
TC <no...@noplace.com> wrote:

>Well... maybe my post wasn't clear. By cut-through I mean cut-through
>routing. Specifically, the ability to begin forwarding a packet before the
>entire packet has been received and validated. As a consequence, the
>received packet may subsequently be found to be corrupt. If it is, the
>forwarded packet needs to be stomped (killed).

Ah. That is the usual definition of cut-through. This error issue is
again the usual way of dealing with errors in interconnects, just like
the header checksum. I guess you call it confusing because PCI doesn't
work that way, not that it's any different from other modern
packet-based protocols.

BTW, I'd appreciate it if you wouldn't top post. Thanks. It's doubly
confusing when you top-post on top of a thread of bottom-posted
messages, like you just did.

-- greg

Stephen Fuld

unread,

Oct 23, 2004, 12:07:30 PM10/23/04

to

<ly...@garlic.com> wrote in message
news:71cb9254.04102...@posting.google.com...

> "Stephen Fuld" <s.f...@PleaseRemove.att.net> wrote in message
> news:<mD1ed.18910$OD2....@bgtnsc05-news.ops.worldnet.att.net>...
>> I never programmed under MVS at that level, but if I had an assembler
>> program that ran under OS, I could run that same object module under MVT.
>> What is MVS doing "behind the scenes" to change the pointers I have set
>> up
>> to do the I/O under OS? What exactly are you sating MVS does? i.e. does
>> it
>> allocate the buffers? If so, then they must be pinned when you do the
>> I/O,
>> or you risk a page fault in the middle of your I/O transfer.

snipped most of good desctription of changes required for implementing
virtual memory.

> This routine
> created a "shadow" copy of the user's channel program CCWs ....
> checked each of the virtual addresses ... as appropriate made sure the
> associated virtual page was resident & pinned in real storage and
> translated the virtual address (from the application channel program
> CCWs) to the appropriate real address (in the "shadow" channel program
> CCWs).

The key words to our discussion here is that the kernel made sure the pages
were "resident and pinned". That was necessary because otherwise many
problems could insue due to the buffers being in pages IN USER SPACE (I did
the emphasis as you suggested, Nick, but it still doesn't feel right. I
meant to emphasize, not to "shout". :-( ) Thus the buffers were pinned
during the I/O and the primary difference when using RDMA is that the
buffers have to be pinned even when there is no I/O going on. This is the
price one has to pay for not paying the overhead of having the OS know about
(so it can pin the buffers) the I/O. Yes, it hinders defragmentation of
real memory, and reduces effective memory size, since, on average, more
memory is taken up by pinned pages. ISTM that is the tradeoff and one might
want to make it or not, but it is not an obviously stupid one to make.

Thanks Lynn.

Nick, is this what you were referring to? Or is there something else here.

Nick Maclaren

unread,

Oct 23, 2004, 12:21:06 PM10/23/04

to

In article <6lved.21678$OD2....@bgtnsc05-news.ops.worldnet.att.net>,

Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:
>
>The key words to our discussion here is that the kernel made sure the pages
>were "resident and pinned". That was necessary because otherwise many
>problems could insue due to the buffers being in pages IN USER SPACE (I did
>the emphasis as you suggested, Nick, but it still doesn't feel right. I
>meant to emphasize, not to "shout". :-( )

Well, we could try underlining :-) I don't think that there is any
conventional agreeable to all.

I am with you so far.

> Thus the buffers were pinned
>during the I/O and the primary difference when using RDMA is that the
>buffers have to be pinned even when there is no I/O going on. This is the
>price one has to pay for not paying the overhead of having the OS know about
>(so it can pin the buffers) the I/O. Yes, it hinders defragmentation of
>real memory, and reduces effective memory size, since, on average, more
>memory is taken up by pinned pages. ISTM that is the tradeoff and one might
>want to make it or not, but it is not an obviously stupid one to make.

Now, there I disagree.

There IS a difference with pages being resident and pinned for the life
of the connexion and being pinned just for the life of the transfer,
and it is an important one. But there is a VASTLY more important
difference, which is the one I was referring to.

Traditional Unix I/O got a lot of its simplicity by COPYING the data
between user space and kernel buffers and back again. That meant that
no part of user space was EVER pinned (except when the single-threaded
kernel was actually operating on it). This was dead easy to implement
and validate, and a user process was swappable at any time it wasn't
actually executing (in itself or in a kernel thread called from it).
No ifs, buts, maybes or anything else.

Any form of DMA to user space breaks the latter. Connexion-lifetime
RDMA increases the visibility of the problems caused by a DMA transfer
having trouble, and means that they can arise even when the process
has no outstanding I/O, but it doesn't introduce any new ones.

Regards,
Nick Maclaren.

Peter Boyle

unread,

Oct 23, 2004, 12:47:40 PM10/23/04

to

On Sat, 23 Oct 2004, Nick Maclaren wrote:

> In article <6lved.21678$OD2....@bgtnsc05-news.ops.worldnet.att.net>,
> Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:
> >
> >The key words to our discussion here is that the kernel made sure the pages
> >were "resident and pinned". That was necessary because otherwise many > >problems could insue due to the buffers being in pages IN USER SPACE (I did
> >the emphasis as you suggested, Nick, but it still doesn't feel right. I
> >meant to emphasize, not to "shout". :-( )
>
> Well, we could try underlining :-) I don't think that there is any
> conventional agreeable to all.

FWIW, my opinion is *in user space* is not quite so loud as
IN USER SPACE.

Peter

Patrick Geoffray

unread,

Oct 23, 2004, 3:29:29 PM10/23/04

to Stephen Fuld

Stephen Fuld wrote:
>>InfiniBand makes this particularly
>>obtrusive by (effectively) keeping its own copy of the virtual
>>to physical mapping and ownership in the device.
>
>

> Well, the device keeps a copy of the adddress of a queue pair descriptor,
> not directly the address into which it is going to do the transfer. The
> queue descriptor has the actual transfer address. That is an important
> distinction, as it is a way of fixing one of the problems you discuss below.

AFAIK, the current implementations of IB keep the physical addresses of
a registered user memory region in the NIC (a lot of DRAM). The NIC
fetches the QP descriptor from a queue (so the nic has the physical
addresses describing this queue) but that's just for the control. For
the data, ie for RDMA, the physical addresses are on the NIC (pushed at
registration time or cached from a previous usage). When you deregister
memory, you have to invalid these physical addresses in the NIC, and
it's not cheap.

Patrick

Patrick Geoffray

unread,

Oct 23, 2004, 3:51:00 PM10/23/04

to Nick Maclaren

Nick Maclaren wrote:
> My point was and is the undesirability of more-or-less arbitrary
> (user controlled) RDMA in a multi-purpose, shared use operating
> system (especially one using virtual addressing) - and the fact
> that computing history shows how and why it is a nightmare. No
> more and no less.

You are absolutely right.

A practical example is the IOMMU trashing: leaving pages pinned in the
host for a long time means that the related entries in the IOMMU
(sitting between the memory bus and the IO bus) are not released. As the
size of the IOMMU is usually very limited (because the techies thought
of it as being used only by DMA operations in progress), you may end up
with a process pinning too much memory and preventing anybody else to
pin memory. On PPC64 for example, the IOMMU typically covers only 256
MB. The solution is to limit the amount of memory that a process can
register, but that forces the deregistration early and it costs more
than doing a copy to a bounce buffer.

There are other side effects to application registering memory: on
Solaris for example, the OS would wait for all of the pinned pages to be
unpinned before terminating a process, and it would do that before
closing the file descriptors of the process. It makes sense to wait for
the pages to be unpinned because, in the mind of the Solaris folks,
they are supposed to be pinned just for the time of the DMA operations.
Unfortunately, when you kill a process, the driver would do the
cleanup when its file descriptor is closed, but the OS does not close
the file descriptor until the pages are unpinned. So, kill a process
(just with CTRL+C) and go for the reset button...

The only way to do RDMA safely and efficiently would be for the OS to
keep an large IOMMU in sync with its own page table and use virtual
addresses on the IO devices. No pinning. Device independant.

Patrick

Kai Harrekilde-Petersen

unread,

Oct 23, 2004, 7:51:16 PM10/23/04

to

iai...@truecircuits.com (Iain McClatchie) writes:

> The invertible hash function is good for other things too. Sometimes
> a router will need to rewrite some of the fields in the header. I know
> they rewrite the level-2 destination address, I think they append extra
> tags for ATM-like routing (some sort of Cisco invention), and I suspect
> there are plenty of other changes too, not even getting into stuff like
> NAT.

An IP router will, source unicast packets, update the Layer 2 source
and destination address, decrement the TTL, and update the checksum.

For multicast, the Layer-2 destination address is not changed (at
least not for Ethernet. I don't know for other protocols).

There may be other updates/changes, like VLAN tagging/untagging and
DSCP codepoint changes, but those are unrelated to the IP routing in
itself.

> Now what I don't know, is: can a router figure out from just the header
> bits whether the extra two bytes have been allocated at the end of the
> packet, in which case it can do fast path routing? I suppose it can,
> but I suspect it has to know how to parse the tags for the packet payload
> in order to see how long the payload thinks it is, versus how many bytes
> are being sent. So this extra-big wrapper idea isn't completely general.

Most routers cannot do "fast-path" processing of IP packets with options.

> That said, if I were designing a packet system, I'd have one checksum at
> the end of the packet, and have it cover all the bits in the packet and
> header.

IIRC the SGI ccNUMA packet design used exactly that approach. I
remember reading an article on the "Spider" router chip, where this
was described.

Kai
--
Kai Harrekilde-Petersen <khp(at)harrekilde(dot)dk>

Kai Harrekilde-Petersen

unread,

Oct 23, 2004, 8:07:12 PM10/23/04

to

Joachim Worringen <see.signatur...@ccrl-nece.de> writes:

> Del Cecchi in <2t5e0lF...@uni-berlin.de>:
> not a cache coherent thing, more like serial PCI. So it isn't a
>> NUMA
>> appropriate interconnect. Aren't all NUMA interconnects proprietary? I
>> can't think of one that isn't. (well, SCI isn't, but I don't know anyone
>> using it anymore without modifications)(but I could easily be mistaken)
>
> AFAIK, the Dolphin PCI-SCI adapters talk IEEE-SCI (w/o coherence, of course)
> [see http://www.dolphinics.com]. But this is an I/O interconnect.

They are both an I/O interconnect and a NUMA interconnect. There are
two distinctly different uses of the PCI-SCI Bridge (or just "PSB"):
As a "PCI-to-PCI" network, and a an I/O expansion controller.

Most of the differences are really in how you look at the world, and
what is your starting and ending point.

For the NUMA applications, the starting and ending points are two
CPU/memory nodes, while in the I/O expansion scenario, the endpoint is
a PCI bus itself. In order to support the I/O expansion mode, the PSB
includes the capability of doing PCI Config cycles (but the board must
be laid out specifically for this use).

But back to the modifications: Just before I left Dolphin there was a
push internally to add support for 128 byte SCI packets (only 1-16B,
64B and 256B packets sizes are defined by the SCI standard) in order
to increase the effective utilization of the ringlets. The PSB team
(including me) was pushing for adding the 128B packets mode; the
LinkController team did not want to include it because the packet
buffers in the LC2 was on the critical path of that design, and making
them bigger would slow down the LC.

The end result was that the PSB team added support for 128B packets to
either the PSB64 or the PSB66 (I forget which one), and we would them
have to wait to see if and when the LinkController added support for
it too.

ly...@garlic.com

unread,

Oct 23, 2004, 8:29:00 PM10/23/04

to

"Stephen Fuld" <s.f...@PleaseRemove.att.net> wrote in message news:<6lved.21678$OD2....@bgtnsc05-news.ops.worldnet.att.net>...

> The key words to our discussion here is that the kernel made sure the pages
> were "resident and pinned". That was necessary because otherwise many
> problems could insue due to the buffers being in pages IN USER SPACE (I did
> the emphasis as you suggested, Nick, but it still doesn't feel right. I
> meant to emphasize, not to "shout". :-( ) Thus the buffers were pinned
> during the I/O and the primary difference when using RDMA is that the
> buffers have to be pinned even when there is no I/O going on. This is the
> price one has to pay for not paying the overhead of having the OS know about
> (so it can pin the buffers) the I/O. Yes, it hinders defragmentation of
> real memory, and reduces effective memory size, since, on average, more
> memory is taken up by pinned pages. ISTM that is the tradeoff and one might
> want to make it or not, but it is not an obviously stupid one to make.
>
> Thanks Lynn.
>
> Nick, is this what you were referring to? Or is there something else here.

so there are (at least) two possible gotcha's in the model ... one is
that
the pages are pinned and the operation is then scheduled ... and then
the pages
remain pinned until after the whole operation has signaled final
completion. another is that on read/input ... the read operation
might specify the maximum possible input size (requiring all possible
associated virtual pages to be pinned) ... even when the actual input
turns out to be much less than the maximum.

oldtime scenario might involve single channel program that would read
(or write) a full 3380 cylinder (say 15 tracks times about 40kbytes
.... on the order of 150 4k pages).

del cecchi

unread,

Oct 23, 2004, 11:03:21 PM10/23/04

to

"Kai Harrekilde-Petersen" <k...@harrekilde.dk> wrote in message
news:ufz456...@harrekilde.dk...

The only problem with the single CRC approach comes if the header gets
modified as part of the routing. I think that IB has the source figure
out the routing. Then each switch chops off its part of the routing
from the packet. RIO on the other hand left it up to the "agent"
(switch) to figure out. If the routing data gets modified then the crc
that covers it needs to be modified as well. So IB has two crc.

del cecchi

Stephen Fuld

unread,

Oct 24, 2004, 1:56:59 AM10/24/04

to

"Patrick Geoffray" <pat...@myri.com> wrote in message

news:417AB624...@myri.com...

snip

> The only way to do RDMA safely and efficiently would be for the OS to keep
> an large IOMMU in sync with its own page table and use virtual addresses
> on the IO devices. No pinning. Device independant.

I am missing something in your post. The issues of virtual versus physical
addresses and the requirement for pinning are different issues. The pages
that contain where the I/O transfer is to/from need to be pinned (at least
while the I/O is active) in order to prevent say reading data from a page
that, because it wasn't pinned, got overlaid by the kernel with some other
page. The virtual to physical translation has to be done somewhere but
there are choices about where and when. For example, IIRC IB deals with
virtual addresses as far as the user is concerned, but the translation is
done by the host channel adaptor based on data it got at registration time.

Stephen Fuld

unread,

Oct 24, 2004, 1:57:03 AM10/24/04

to

"Patrick Geoffray" <pat...@myri.com> wrote in message

news:417AB119...@myri.com...

OK, but we are talking device error recovery here, so it is pretty rare.
Expensive when it happens, but perhaps on average not bad.

Stephen Fuld

unread,

Oct 24, 2004, 1:57:01 AM10/24/04

to

"Nick Maclaren" <nm...@cus.cam.ac.uk> wrote in message

news:cle0di$bia$1...@gemini.csx.cam.ac.uk...

Agreed. I never said there wasn't a difference.

> But there is a VASTLY more important
> difference, which is the one I was referring to.
>
> Traditional Unix I/O got a lot of its simplicity by COPYING the data
> between user space and kernel buffers and back again.

Yes.

> That meant that
> no part of user space was EVER pinned (except when the single-threaded
> kernel was actually operating on it). This was dead easy to implement
> and validate, and a user process was swappable at any time it wasn't
> actually executing (in itself or in a kernel thread called from it).
> No ifs, buts, maybes or anything else.

Well, what about appplications that did Unix raw I/O in order to avoid the
buffer copy?

> Any form of DMA to user space breaks the latter. Connexion-lifetime
> RDMA increases the visibility of the problems caused by a DMA transfer
> having trouble, and means that they can arise even when the process
> has no outstanding I/O, but it doesn't introduce any new ones.

Agreed.

jmfb...@aol.com

unread,

Oct 24, 2004, 6:34:11 AM10/24/04

to

In article <cle0di$bia$1...@gemini.csx.cam.ac.uk>,

In TOPS-10 a user mode program could tell the monitor to pin the
job down; we called it locked in core and was done with a LOCK UUO.
Locking in core was treated as an exception, not the default.

>
>Any form of DMA to user space breaks the latter. Connexion-lifetime
>RDMA increases the visibility of the problems caused by a DMA transfer
>having trouble, and means that they can arise even when the process
>has no outstanding I/O, but it doesn't introduce any new ones.

/BAH

Subtract a hundred and four for e-mail.