Cray to commercialize Red Storm

Del Cecchi

unread,

Oct 27, 2003, 2:03:26 PM10/27/03

to

Just saw an announcement that y'all will be able to buy your very own Red
Storm from Cray. This is a machine made up of thousands of Opterons linked
by a fabric. Not sure about the topology.

RM, this building block is similar to what you were asking for, I think.
Runs Linux.

del cecchi

Robert Myers

unread,

Oct 27, 2003, 2:46:18 PM10/27/03

to

On Mon, 27 Oct 2003 08:03:26 -0600, "Del Cecchi" <cec...@us.ibm.com>
wrote:

http://www.cray.com/news/0310/rsproduct.html

Zowie! Now if we just had a _little_ more information about that
magic fabric!

Can hardly wait to see how things scale.

RM

Del Cecchi

unread,

Oct 27, 2003, 3:37:52 PM10/27/03

to

"Robert Myers" <rmy...@rustuck.com> wrote in message
news:tlbqpv0hffil2mke2...@4ax.com...

well, a partial quote from theinquirer.net says

"This will use 10,368 Opteron/Sledgehammers and they're interconnected with
Black Widow technology, using a 27 x 16 x 24 mesh, and using 2GHz Opterons.

Cray claims that the network interface will pass data between CPUs at 3GB/s
in each direction.

And this technology will implement the IO subsystem Cray codenames "Black
Widow".

A Red Storm Cray node will have 1GB of SDRAM, use a Cray NIC, have 3GB/s
hypertransport, a PCI-X bridge, and "Black Widow".

Well, that's what Burton Smith told NASA not so long ago, anway"

Andy Isaacson

unread,

Oct 29, 2003, 3:48:52 AM10/29/03

to

In article <tlbqpv0hffil2mke2...@4ax.com>,

There's quite a bit more information at
http://www.lanl.gov/orgs/ccn/salishan2003/pdf/camp.pdf
linked from
http://www.theinquirer.net/?article=12387

It's a 3D mesh (not torus).

It doesn't run Linux on the app nodes, it runs Catamount (presumably a
follow-on to the Cougar light-weight kernel that ran on ASCI Red).

Camp refers to "just one really custom part", presumably a router chip.

Still no info about the network design, but it does say that it's source
routed, which I think leaves plain HT out of the running. (Not that HT
scales to that size anyways.) Quotes 4.1 GB/sec bi-directional sustained
bandwidth, 2 microseconds nearest-neighbor latency, 5 us worst-case
latency.

4 Opterons per board, 4 router chips per board, 8 boards per "card cage",
3 cages per rack, for 96 CPUs per rack.

1 GB memory per CPU, 8 GB max memory per CPU.

One can assume the router connects to the memory system as an HT client.

-andy

Greg Lindahl

unread,

Oct 29, 2003, 4:25:40 AM10/29/03

to

In article <3f9f38a3$0$75887$a186...@newsreader.visi.com>,
Andy Isaacson <a...@pirx.lan> wrote:

>Camp refers to "just one really custom part", presumably a router chip.

It's a catch-all network and south-bridge chip. There are conference
presentations which describe it and the system in much more detail
than you seem to have found.

-- greg

David Wang

unread,

Oct 29, 2003, 5:46:34 AM10/29/03

to

Hotchips Presentation?

http://www.hotchips.org/archive/hc15/pdf/1.cray.pdf

--
davewang202(at)yahoo(dot)com

Bob Alverson

unread,

Oct 29, 2003, 5:44:45 AM10/29/03

to

>"Robert Myers" <rmy...@rustuck.com> wrote in message
>news:tlbqpv0hffil2mke2...@4ax.com...
>>

>> Zowie! Now if we just had a _little_ more information about that
>> magic fabric!
>>
>> Can hardly wait to see how things scale.
>>
>> RM

You can look at http://www.hotchips.org/archive/hc15/pdf/1.cray.pdf

Bob

Del Cecchi

unread,

Oct 29, 2003, 1:54:14 PM10/29/03

to

"Andy Isaacson" <a...@pirx.lan> wrote in message
news:3f9f38a3$0$75887$a186...@newsreader.visi.com...

The network is not HyperTransport. I don't know what it is, but it isn't
HT.

del cecchi

Nick Maclaren

unread,

Oct 29, 2003, 3:16:23 PM10/29/03

to

In article <bnogq6$ube$1...@news.rchland.ibm.com>,

"Del Cecchi" <cec...@us.ibm.com> writes:
|>
|> The network is not HyperTransport. I don't know what it is, but
|> it isn't HT.

I wouldn't bet on it. I agree that it certainly isn't HyperTransport,
as per the specification on www.hypertransport.org, but one of their
meetings a year or two back had Cray talking about HyperTransport over
optical links, and I assumed then that such work was intended for Red
Storm (which, then, did not have that name).

My guess is that, if AMD pulls things off (including HyperTransport),
HyperTransport 2 will probably include AMD's SMP support and will
definitely include support for medium-distance interconnects (such
as this one). So this might well be a HyperTransport 2 prototype.

Regards,
Nick Maclaren.

Andy Isaacson

unread,

Oct 29, 2003, 3:58:08 PM10/29/03

to

In article <bnolk7$aho$1...@pegasus.csx.cam.ac.uk>,

I'm shooting in the dark here, but in theory you could implement a
*router* that has HT *links*, but doesn't actually do "HyperTransport"
over the entire machine. The router would talk HT to the Opteron,
and would talk HT to the next router in the network, but would implement
its own message-passing protocol over the "network" links rather than
implementing the HT SMP protocol.

I'm pretty sure that Red Storm doesn't do this, though, based on the
Hot Chips presentation. Slide 11 talks about the system chip, and says

- IBM 0.13u ASIC process
- 500 MHz embedded PowerPCTM
- 16 bit 1.6 Gbit/sec HyperTransportTM
- Six 12 channel 3.2 Gbit/sec High Speed Serial links
- GDA Technologies Inc. HyperTransport Cave

and slide 14 says

- 6 high speed network links per ASIC
- More than 4 Gbyte/sec per link
- Reliable link protocol with CRC-16 and automatic retry
- Support for up to 32k nodes in 3D toroidal mesh

That implies a custom (or at least not-widely-used) serial protocol
between the routers.

There's also some evidence that the link is electrical, not optical,
but it's not explicitly called out either way. Slide 8 labels the
card connector as GbX, which is apparently a COTS product:
http://www.teradyne.com/prods/tcs/products/connectors/backplane/gbx/

-andy (my real domain is hexapodia.org.)

Peter Boyle

unread,

Oct 29, 2003, 7:06:16 PM10/29/03

to

On Wed, 29 Oct 2003, Andy Isaacson wrote:

>
> I'm shooting in the dark here, but in theory you could implement a
> *router* that has HT *links*, but doesn't actually do "HyperTransport"
> over the entire machine. The router would talk HT to the Opteron,
> and would talk HT to the next router in the network, but would implement
> its own message-passing protocol over the "network" links rather than
> implementing the HT SMP protocol.

I'd guess the HSSL's are LVDS electrically. The edge connectors in the PDF
diagram of the board look suspiciously familiar to me - also I'd hazard a
guess that they're the Tyco HM-ZD high density metric high speed
differential connectors. Not sure though.

Now the interesting question is you could do almost the same with a
top end Virtex-II pro (with 2/3 of the bandwidth - 24x3.2Gbit tranceivers,
four embedded PPC 405's, HT cores available).

If only they weren't so expensive....

Peter

> I'm pretty sure that Red Storm doesn't do this, though, based on the
> Hot Chips presentation. Slide 11 talks about the system chip, and says
>
> - IBM 0.13u ASIC process
> - 500 MHz embedded PowerPCTM
> - 16 bit 1.6 Gbit/sec HyperTransportTM
> - Six 12 channel 3.2 Gbit/sec High Speed Serial links
> - GDA Technologies Inc. HyperTransport Cave
>
> and slide 14 says
>
> - 6 high speed network links per ASIC
> - More than 4 Gbyte/sec per link
> - Reliable link protocol with CRC-16 and automatic retry
> - Support for up to 32k nodes in 3D toroidal mesh
>
> That implies a custom (or at least not-widely-used) serial protocol
> between the routers.
>
> There's also some evidence that the link is electrical, not optical,
> but it's not explicitly called out either way. Slide 8 labels the
> card connector as GbX, which is apparently a COTS product:
> http://www.teradyne.com/prods/tcs/products/connectors/backplane/gbx/
>
> -andy (my real domain is hexapodia.org.)
>

Peter Boyle pbo...@physics.gla.ac.uk

Nick Maclaren

unread,

Oct 29, 2003, 7:21:39 PM10/29/03

to

In article <3f9fe390$0$41297$a186...@newsreader.visi.com>,

Andy Isaacson <a...@pirx.lan> wrote:
>In article <bnolk7$aho$1...@pegasus.csx.cam.ac.uk>,
>Nick Maclaren <nm...@cus.cam.ac.uk> wrote:
>>In article <bnogq6$ube$1...@news.rchland.ibm.com>,
>>"Del Cecchi" <cec...@us.ibm.com> writes:
>>|> The network is not HyperTransport. I don't know what it is, but
>>|> it isn't HT.
>>
>>I wouldn't bet on it. I agree that it certainly isn't HyperTransport,
>>as per the specification on www.hypertransport.org, but one of their
>>meetings a year or two back had Cray talking about HyperTransport over
>>optical links, and I assumed then that such work was intended for Red
>>Storm (which, then, did not have that name).
>>
>>My guess is that, if AMD pulls things off (including HyperTransport),
>>HyperTransport 2 will probably include AMD's SMP support and will
>>definitely include support for medium-distance interconnects (such
>>as this one). So this might well be a HyperTransport 2 prototype.
>
>I'm shooting in the dark here, but in theory you could implement a
>*router* that has HT *links*, but doesn't actually do "HyperTransport"
>over the entire machine. The router would talk HT to the Opteron,
>and would talk HT to the next router in the network, but would implement
>its own message-passing protocol over the "network" links rather than
>implementing the HT SMP protocol.
>
>I'm pretty sure that Red Storm doesn't do this, though, based on the
>Hot Chips presentation. Slide 11 talks about the system chip, and says
>

>That implies a custom (or at least not-widely-used) serial protocol
>between the routers.

Ah! Another confusion because we are talking about different levels
of protocol! Yes, at the bottom level, I agree. What I think that
may well be the case is that the design is intended to be compatible
with or similar to HyperTransport at the higher levels, and may be
the basis for an extension in the next revision of the protocol.

And, of course, I never meant that this was the same as Coherent
HyperTransport - I would expect the medium range and SMP extensions
to be optional and probably incompatible, in the sense that an
interface would be one of:

Base HyperTransport, in-box, aimed at peripherals
Coherent HyperTransport, in-box, aimed at CPUs
Optical HyperTransport, in-room, aimed at interconnects

With all three inter-operating, in some suitable sense.

But that is based on what I would do, coupled with the vague hints
that have been given to us peasants.

Regards,
Nick Maclaren.

Robert Myers

unread,

Oct 30, 2003, 5:10:59 AM10/30/03

to

That is one hot chip. Even allowing for all the SERDES to be done by
custom circuitry, there is barely the bandwidth in the 500MHz PPC to
touch the bits even once (naive calculation: 0.5 Gbit/sec*64 bit width
processing 6*3.2Gbit/sec bit streams), so whatever "message
preparation" it does must not include even calculating a checksum, nor
even in all likelihood even touching the body of the message. Maybe
that's just garden variety router design.

>There's also some evidence that the link is electrical, not optical,
>but it's not explicitly called out either way.

The labelling on the six network channels is interesting (X+,
X-,Y+,Y-, Z+,Z-), suggesting the system is routing by hops, one
processing node per hop. A little hard to reconcile that with the
2usec nearest neighbor, 5usec worst case. Also hard to reconcile with
the code name "Black Widow", the black widow being a spider notorious
for an irregular web.

>Slide 8 labels the
>card connector as GbX, which is apparently a COTS product:
>http://www.teradyne.com/prods/tcs/products/connectors/backplane/gbx/
>
>-andy (my real domain is hexapodia.org.)

The GbX connector sure seems to imply an electrical connection, but I
stumbled across

http://www.htc.honeywell.com/photonics/doc's/santafe98a.pdf

an optical link project in which Cray was involved that would have the
right (projected) properties.

Somehow hard for me to imagine a 3GHz electrical signal going
backplane to pin and back however many times to get to the most
distant node, but what would I know?

RM

Nick Maclaren

unread,

Oct 30, 2003, 9:30:35 AM10/30/03

to

In article <vb21qvkr7gfsa731t...@4ax.com>,

Robert Myers <rmy...@rustuck.com> wrote:
>>
>That is one hot chip. Even allowing for all the SERDES to be done by
>custom circuitry, there is barely the bandwidth in the 500MHz PPC to
>touch the bits even once (naive calculation: 0.5 Gbit/sec*64 bit width
>processing 6*3.2Gbit/sec bit streams), so whatever "message
>preparation" it does must not include even calculating a checksum, nor
>even in all likelihood even touching the body of the message. Maybe
>that's just garden variety router design.

It isn't that hard (for the CPU) with a decent protocol. Checksums
can be handled entirely in the ASIC part, and can be designed to be
highly parallel. And, with a decent protocol, why SHOULD it touch
the body of the message? All it has to do is to connect the input
to the output and tell them to do it.

Doubtless there is a lot more to the detail, but CONCEPTUALLY that
what can be done. If the protocol and implementation are is well
designed, the PPC CPU should be quite lightly loaded.

>The labelling on the six network channels is interesting (X+,
>X-,Y+,Y-, Z+,Z-), suggesting the system is routing by hops, one
>processing node per hop. A little hard to reconcile that with the
>2usec nearest neighbor, 5usec worst case. Also hard to reconcile with
>the code name "Black Widow", the black widow being a spider notorious
>for an irregular web.

It is a bit surprising for a 16x24x27 mesh, I agree, but I don't see
that it is impossible. Let us assume a protocol where a packet starts
with a path description and is fully streamed. That is a latency of
74 nanoseconds per hop - fast but by no means unreasonable.

HOWEVER, if they have done that sort of thing, then it is likely to
have serious degradation characteristics under heavy, conflicting
load. I just happen to have a suitable MPI test program that will
stress such use :-)

One partial solution to that is to concentrate on the MPI collectives
for MPI_COMM_WORLD (with perhaps a few extensions) - this is plausible
in the ASCI and most HPC contexts. You then tune the library to not
generate the simultaneous activities that cause trouble. Any user
who uses point-to-point transfers is assumed to be doing only a few
of them, or at a fairly low rate.

An analysis of that approach is why I said that I did not think that
switched networks were a good idea for HPC communication. I reckon
that I could do BETTER on even a simple 2-D torus than with current,
affordable switches on 1,000+ nodes. And, of course, improved 2-D
and 3-D tori are even better.

Regards,
Nick Maclaren.

Robert Myers

unread,

Oct 30, 2003, 3:24:39 PM10/30/03

to

On 30 Oct 2003 09:30:35 GMT, nm...@cus.cam.ac.uk (Nick Maclaren) wrote:

<snip>

>
>HOWEVER, if they have done that sort of thing, then it is likely to
>have serious degradation characteristics under heavy, conflicting
>load.

I think you're being kind. Unless there's something we're not being
told, it should have been code-named Downtown Manhattan. The
advantage of a switch is that at least the switch knows what's going
on. Easy to do also with an out-of-band control network, but if such
a thing is a part of the design, there's no indication of it.

RM

Andy Isaacson

unread,

Oct 30, 2003, 4:16:26 PM10/30/03

to

In article <bnqlnr$34u$1...@pegasus.csx.cam.ac.uk>,

Nick Maclaren <nm...@cus.cam.ac.uk> wrote:
>In article <vb21qvkr7gfsa731t...@4ax.com>,
>Robert Myers <rmy...@rustuck.com> wrote:
>>That is one hot chip. Even allowing for all the SERDES to be done by
>>custom circuitry, there is barely the bandwidth in the 500MHz PPC to
>>touch the bits even once (naive calculation: 0.5 Gbit/sec*64 bit width
>>processing 6*3.2Gbit/sec bit streams), so whatever "message
>>preparation" it does must not include even calculating a checksum, nor
>>even in all likelihood even touching the body of the message. Maybe
>>that's just garden variety router design.
>
>It isn't that hard (for the CPU) with a decent protocol. Checksums
>can be handled entirely in the ASIC part, and can be designed to be
>highly parallel. And, with a decent protocol, why SHOULD it touch
>the body of the message? All it has to do is to connect the input
>to the output and tell them to do it.
>
>Doubtless there is a lot more to the detail, but CONCEPTUALLY that
>what can be done. If the protocol and implementation are is well
>designed, the PPC CPU should be quite lightly loaded.

Precisely, there's absolutely no reason for the PPC to touch the data.
In fact, doing so would make it utterly impossible for the router to
even come close to the latency figures noted.

>>The labelling on the six network channels is interesting (X+,
>>X-,Y+,Y-, Z+,Z-), suggesting the system is routing by hops, one
>>processing node per hop. A little hard to reconcile that with the
>>2usec nearest neighbor, 5usec worst case. Also hard to reconcile with
>>the code name "Black Widow", the black widow being a spider notorious
>>for an irregular web.

You're confusing your product code names; Black Widow is in the SV2
(now X1) series, according to
http://www.cray.com/news/0310/q3_2003_results.html

>It is a bit surprising for a 16x24x27 mesh, I agree, but I don't see
>that it is impossible. Let us assume a protocol where a packet starts
>with a path description and is fully streamed. That is a latency of
>74 nanoseconds per hop - fast but by no means unreasonable.

Assuming source routing, as well, there's no reason this design wouldn't
perform like a linear scaling of Myrinet's performance.

I don't see why Robert finds it hard to reconcile the performance
figures with the design. This is pretty similar to the T3E design, and
that one certainly had excellent latency figures.

I'd also be curious to hear what your opinion of the T3E is, Nick.
Does it perform well enough on your MPI stress tests?

-andy

Stephen Fuld

unread,

Oct 30, 2003, 5:28:13 PM10/30/03

to

"Robert Myers" <rmy...@rustuck.com> wrote in message

news:vb21qvkr7gfsa731t...@4ax.com...

snip

> That is one hot chip. Even allowing for all the SERDES to be done by
> custom circuitry, there is barely the bandwidth in the 500MHz PPC to
> touch the bits even once (naive calculation: 0.5 Gbit/sec*64 bit width
> processing 6*3.2Gbit/sec bit streams), so whatever "message
> preparation" it does must not include even calculating a checksum, nor
> even in all likelihood even touching the body of the message. Maybe
> that's just garden variety router design.

The idea that a processor should be used to calculate a checksum is one born
out of the peculiar blindness of some people with a network background who
can't get beyond thinking that TCP/IP is "the" protocol. In most cases, a
reasonably designed protocol will allow the checksum to be
calculated/checked in a modest piece of dedicated hardware as the data flows
by on its way into or out of the chip.

--
- Stephen Fuld
e-mail address disguised to prevent spam

Robert Myers

unread,

Oct 30, 2003, 5:33:12 PM10/30/03

to

On 30 Oct 2003 16:16:26 GMT, a...@pirx.lan (Andy Isaacson) wrote:

<snip>

>
>I don't see why Robert finds it hard to reconcile the performance
>figures with the design. This is pretty similar to the T3E design, and
>that one certainly had excellent latency figures.
>

If I posted "How do I understand the rationale behind this design,"
I'd get no answer. If I say something makes no sense to me, somebody
says, geez, what's different about this from the T3E. Now I can go
look at the history of the T3E and get the answer to my question. ;-).

RM

Del Cecchi

unread,

Oct 30, 2003, 6:30:18 PM10/30/03

to

"Stephen Fuld" <s.f...@PleaseRemove.att.net> wrote in message
news:NSbob.199957$0v4.15...@bgtnsc04-news.ops.worldnet.att.net...
>
snip

>
> The idea that a processor should be used to calculate a checksum is one
born
> out of the peculiar blindness of some people with a network background who
> can't get beyond thinking that TCP/IP is "the" protocol. In most cases, a
> reasonably designed protocol will allow the checksum to be
> calculated/checked in a modest piece of dedicated hardware as the data
flows
> by on its way into or out of the chip.
>
> --
> - Stephen Fuld
> e-mail address disguised to prevent spam
>

You mean people don't check the packet length and crc with the hardware as
it comes in? I'm stunned.

del cecchi
>

Del Cecchi

unread,

Oct 30, 2003, 6:25:49 PM10/30/03

to

"Robert Myers" <rmy...@rustuck.com> wrote in message
news:vb21qvkr7gfsa731t...@4ax.com...

> On 29 Oct 2003 15:58:08 GMT, a...@pirx.lan (Andy Isaacson) wrote:
>

snip (watch attributions)

> >I'm pretty sure that Red Storm doesn't do this, though, based on the
> >Hot Chips presentation. Slide 11 talks about the system chip, and says
> >
> > - IBM 0.13u ASIC process
> > - 500 MHz embedded PowerPCTM
> > - 16 bit 1.6 Gbit/sec HyperTransportTM
> > - Six 12 channel 3.2 Gbit/sec High Speed Serial links
> > - GDA Technologies Inc. HyperTransport Cave
> >
> >and slide 14 says
> >
> > - 6 high speed network links per ASIC
> > - More than 4 Gbyte/sec per link
> > - Reliable link protocol with CRC-16 and automatic retry
> > - Support for up to 32k nodes in 3D toroidal mesh
> >
> >That implies a custom (or at least not-widely-used) serial protocol
> >between the routers.
> >
> That is one hot chip. Even allowing for all the SERDES to be done by
> custom circuitry, there is barely the bandwidth in the 500MHz PPC to
> touch the bits even once (naive calculation: 0.5 Gbit/sec*64 bit width
> processing 6*3.2Gbit/sec bit streams), so whatever "message
> preparation" it does must not include even calculating a checksum, nor
> even in all likelihood even touching the body of the message. Maybe
> that's just garden variety router design.

The 6 links are each 12 bits wide. at 3.2Gbits/second/differential pair.
That is 12 in and 12 out per link.
I think the PPC must just be there to manage the tables and watch the errors
and stuff.

>
> >There's also some evidence that the link is electrical, not optical,
> >but it's not explicitly called out either way.
>
> The labelling on the six network channels is interesting (X+,
> X-,Y+,Y-, Z+,Z-), suggesting the system is routing by hops, one
> processing node per hop. A little hard to reconcile that with the
> 2usec nearest neighbor, 5usec worst case. Also hard to reconcile with
> the code name "Black Widow", the black widow being a spider notorious
> for an irregular web.

If one is a node in a 3d mesh, wouldn't one have
up/down/front/back/left/right links? And didn't we just have a discussion
of how long the links are in a mesh, depending on whether it wraps around?

Source routing with a wormhole says the packets would go through the switch
fabric in tens of nanoseconds, not that I know that this is what cray does.
So it takes a microsecond or two to figure out the route to some bizarre
node on the other side of the machine, and then maybe 50 ns/hop to wormhole
route it over there. 30 hops in each dimension is 90*50 or 4500 ns to get
there, for 30 by 30 by 30 nodes.

>
> >Slide 8 labels the
> >card connector as GbX, which is apparently a COTS product:
> >http://www.teradyne.com/prods/tcs/products/connectors/backplane/gbx/
> >
> >-andy (my real domain is hexapodia.org.)
>
> The GbX connector sure seems to imply an electrical connection, but I
> stumbled across
>
> http://www.htc.honeywell.com/photonics/doc's/santafe98a.pdf
>
> an optical link project in which Cray was involved that would have the
> right (projected) properties.
>
> Somehow hard for me to imagine a 3GHz electrical signal going
> backplane to pin and back however many times to get to the most
> distant node, but what would I know?
>
> RM

It gets retimed at every node it passes through.

>

Zak

unread,

Oct 30, 2003, 7:46:36 PM10/30/03

to

Robert Myers wrote:

> The labelling on the six network channels is interesting (X+,
> X-,Y+,Y-, Z+,Z-), suggesting the system is routing by hops, one
> processing node per hop. A little hard to reconcile that with the
> 2usec nearest neighbor, 5usec worst case. Also hard to reconcile with
> the code name "Black Widow", the black widow being a spider notorious
> for an irregular web.

Well... I read this and I had a vision of not jumping one, but multiple
hops - and unequal for + and - directions. + would go one up, - would go
sqrt(heigth) down, or so.

I don't know if it would help a lot, and it probably won't make sense to
make the connections oblique (Z doing some just in the X or Y direction
for example) - there are 3 independent axis anyway.

This would fit teh above description nicely...

Thomas

Terje Mathisen

unread,

Oct 30, 2003, 8:04:49 PM10/30/03

to

Stephen Fuld wrote:

I used to be totally in this camp, but there is one very good reason for
having a cpu-to-cpu checksum on any kind of transfer:

It will catch those _very_ rare but also _very_ destructive cases where
things that "cannot happen" still do, i.e. an Ethernet card that
silently drops bytes, while confirming the CRC checksum, but only if at
least one transmitter is breaking the Ethernet spec.

I've seen this happen with at laest two different "server-level" network
cards, working inside boxes with at some form of ECC on all buses.

I.e. all i/o was protected at all levels, but we still got silent data
corruption about once a day or so, when replicating GB's of data to
multiple servers.

Today I would have used rsync on top of ssh instead.

Terje
--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

Stephen Sprunk

unread,

Oct 30, 2003, 9:26:38 PM10/30/03

to

"Robert Myers" <rmy...@rustuck.com> wrote in message
news:vb21qvkr7gfsa731t...@4ax.com...

> That is one hot chip. Even allowing for all the SERDES to be done by
> custom circuitry, there is barely the bandwidth in the 500MHz PPC to
> touch the bits even once (naive calculation: 0.5 Gbit/sec*64 bit width
> processing 6*3.2Gbit/sec bit streams), so whatever "message
> preparation" it does must not include even calculating a checksum, nor
> even in all likelihood even touching the body of the message. Maybe
> that's just garden variety router design.

Well, I hate to call something this fast "garden variety", but the general
idea is common enough. The PPC's sw is just there for observing the
network, building forwarding tables, etc., which is now called the "control
plane" of the network. The router chip may do some sort of checksum for
error detection on each link, but that would be done at wire rate inside the
ASIC itself; cpu-to-cpu traffic should never hit the PPC. This purely hw
forwarding system is now called the "data plane".

S

--
Stephen Sprunk "God does not play dice." --Albert Einstein
CCIE #3723 "God is an inveterate gambler, and He throws the
K5SSS dice at every possible opportunity." --Stephen Hawking

Thomas Gooding

unread,

Oct 30, 2003, 9:46:23 PM10/30/03

to

Terje Mathisen wrote:

> Stephen Fuld wrote:
>> The idea that a processor should be used to calculate a checksum is
>> one born
>> out of the peculiar blindness of some people with a network background
>> who
>> can't get beyond thinking that TCP/IP is "the" protocol. In most
>> cases, a
>> reasonably designed protocol will allow the checksum to be
>> calculated/checked in a modest piece of dedicated hardware as the data
>> flows
>> by on its way into or out of the chip.
>
>
> I used to be totally in this camp, but there is one very good reason for
> having a cpu-to-cpu checksum on any kind of transfer:
>
> It will catch those _very_ rare but also _very_ destructive cases where
> things that "cannot happen" still do, i.e. an Ethernet card that
> silently drops bytes, while confirming the CRC checksum, but only if at
> least one transmitter is breaking the Ethernet spec.

It also can catch those buggy network drivers. If the driver mishandles
a ring buffer queue, miscomputes the packet size, or otherwise corrupts
the data - the software chksum calculation may catch the defect and junk
the packet.

Not that any of that would ever happen in the field....

Tom Gooding
tgooding at us.ibm.com
IBM Rochester (my opinions only)

George William Herbert

unread,

Oct 30, 2003, 10:29:59 PM10/30/03

to

Terje Mathisen <terje.m...@hda.hydro.com> wrote:
>I used to be totally in this camp, but there is one very good reason for
>having a cpu-to-cpu checksum on any kind of transfer:
>
>It will catch those _very_ rare but also _very_ destructive cases where
>things that "cannot happen" still do, i.e. an Ethernet card that
>silently drops bytes, while confirming the CRC checksum, but only if at
>least one transmitter is breaking the Ethernet spec.

The problem with this...

You have to trust something, unless you're running something
like a Stratus or Tandem box with lockstepped CPUs. Normally
that is 'I trust the CPU'. In a SMP box, that is 'I trust the CPUs'.

You have to trust certain aspects of the RAM and memory controllers.
Even if you checksum and ECC everything, when you ask for a particular
cache line, you assume that you not only are going to get valid data
(ECC confirmed) but that it's the data you asked for. Even with ECC
on all the address lines, you're assuming that the data that comes
in was for the address you specified, and that there wasn't an internal
fault within RAM that led to it putting the wrong data (and the wrong
data's correct ECC) on the bus back.

You have to trust that I/O devices that DMA won't do grossly wrong things.

You have to assume certain things about disk that are similar to
what you have to assume about RAM, though with software RAID you
can distrust here somewhat more. You could make RAM redundant and/or
RAID it but I don't know of anyone other than Stratus/Tandem that do.

Everyone remembers the network controllers that did The Wrong Thing
from time to time, but similar things have happened before with
RAM, Disk, I/O devices that DMA, etc. We have not responded to
all those problems with additional layers of protection to catch
those corner cases. We have generally concentrated on making the
failures low enough probability that we don't care. We don't put
a disk block label and disk block ECC inside data we put out to
disk, and perform CPU validation of all disk reads to make sure
that the data was the correct stuff and is valid. We don't do
that with RAM. We often don't protect memory really stringently
from insane DMAs.

It is not cost effective or sensible to be making system
overall architectures imbalanced in the reliability question.
If there is some immutable reason that unreliability were
inherently present in some component, then that would be
justification. But what we find is, generally, that all
the stuff works ok. ECC in the RAM catches effectively
all of the RAM problems; disks internal ECC catches most of
their problems; most ASICs don't do Really Bad Things inside
the system. And that we sometimes get corrupted bus traffic,
so that ECCing that makes sense. But for the most part,
the odds of network chip corruption of data are not much
worse than of RAM corruption of data or disk corruption
of data or other bad things in the system.

When we chose to inflict a significant CPU load in order to
elevate one of those relatively equally unlikely corner cases
above the rest, that's probably a bad overall architecture choice.

This is an internal version of the tradeoff that has to be
made in system design, and then in operating environment
and staffing, about reliability and cost. Using mirrored
RAID storage in a system without ECC in the memory path
is often silly; you have ignored a major reliability
problem in one area and spent a lot on another, when
a more balanced approach is better for overall reliability.
It's similar to the problem of spending serious bucks on
getting redundant HA servers and redundant RAID arrays
for the data in a SAN, and then not buying a good enough
UPS system for when the power goes out. And so on...

-george william herbert
gher...@retro.com

Greg Lindahl

unread,

Oct 30, 2003, 10:55:30 PM10/30/03

to

In article <bns3d7$o29$1...@gw.retro.com>,

George William Herbert <gher...@gw.retro.com> wrote:

>You have to trust something,

It's still good practice to add some additional layers of
error-checking. It helps you diagnose bugs faster, and it helps you
figure out what the error rates actually are.

You will find that several MPI libraries (message passing API used in
HPC) have an extra software checksum that can optionally be turned
on. Why do you think that option is there? I assure you it isn't
usually added until after a generation or two...

In the enterprise space, customer tolerance for bugs is low. If you're
going to use crappy, crappy busses like PCI, it pays to be able to
make sure it actually works.

Now whether or not all these additional layers get turned on all the
time is another issue. That seems to be what you've addressed in your
posting. That depends on how costly the extra checks are, and how
much the customer cares about performance.

Good reading:

When The CRC and TCP Checksum Disagree
Jonathan Stone, Craig Partridge
http://sigcomm.it.uu.se/conf/paper/sigcomm2000-9-1.ps.gz

Abstract: Traces of Internet packets from the past two years show that
between 1 packet in 1,100 and 1 packet in 32,000 fails the TCP
checksum, even on links where link-level CRCs should catch all but 1
in 4 billion errors. For certain situations, the rate of checksum
failures can be even higher: in one hour-long test we observed a
checksum failure of 1 packet in 400. We investigate why so many errors
are observed, when link-level CRCs should catch nearly all of them.

What do you think the right response to the situation outlined in
this paper is?

-- greg

Bill Todd

unread,

Oct 30, 2003, 11:49:30 PM10/30/03

to

While there is a lot of merit in what you say, it's not the *only* valid
view.

"George William Herbert" <gher...@gw.retro.com> wrote in message
news:bns3d7$o29$1...@gw.retro.com...

> Terje Mathisen <terje.m...@hda.hydro.com> wrote:
> >I used to be totally in this camp, but there is one very good reason for
> >having a cpu-to-cpu checksum on any kind of transfer:
> >
> >It will catch those _very_ rare but also _very_ destructive cases where
> >things that "cannot happen" still do, i.e. an Ethernet card that
> >silently drops bytes, while confirming the CRC checksum, but only if at
> >least one transmitter is breaking the Ethernet spec.
>
> The problem with this...
>
> You have to trust something, unless you're running something
> like a Stratus or Tandem box with lockstepped CPUs.

Exactly. And there's a *reason* that those systems exist: you *can't*
trust the hardware completely, or even to the degree that at least some
customers require.

As a result, it is worth at least considering just what steps you can take
to make the *rest* of the system as reliable as the Tandem- or Stratus-style
portion.

Normally
> that is 'I trust the CPU'. In a SMP box, that is 'I trust the CPUs'.
>
> You have to trust certain aspects of the RAM and memory controllers.
> Even if you checksum and ECC everything, when you ask for a particular
> cache line, you assume that you not only are going to get valid data
> (ECC confirmed) but that it's the data you asked for. Even with ECC
> on all the address lines, you're assuming that the data that comes
> in was for the address you specified, and that there wasn't an internal
> fault within RAM that led to it putting the wrong data (and the wrong
> data's correct ECC) on the bus back.

These things Tandem and Stratus address, I think. But those approaches
aren't the *only* way to improve reliability in these areas: for example,
defensive programming in both the OS and the applications it runs can
certainly catch some significant percentage of such errors before they
propagate irreversibly - it may not be as bomb-proof as the Tandem/Stratus
hardware-based mechanisms, but it's a hell of a lot better than nothing.

>
> You have to trust that I/O devices that DMA won't do grossly wrong things.

Again, defensive programming can at least help here (especially in areas
where the Tandem/Stratus approaches won't).

>
> You have to assume certain things about disk that are similar to
> what you have to assume about RAM,

No, you don't, and not all systems do. More on that as we go along below.

though with software RAID you
> can distrust here somewhat more. You could make RAM redundant and/or
> RAID it but I don't know of anyone other than Stratus/Tandem that do.

The new (EV7) Alpha systems support parity-RAID memory, and my vague
recollection is that some IBM systems do as well. Now, whether they *check*
the consistency of the parity on every access, or just use it to recover
data that the ECC check fails on, I don't know.

>
> Everyone remembers the network controllers that did The Wrong Thing
> from time to time, but similar things have happened before with
> RAM, Disk, I/O devices that DMA, etc. We have not responded to
> all those problems with additional layers of protection to catch
> those corner cases.

At least in the case of disks, we at least sometimes have. Several vendors
of mid-range to high-end storage offer what they call 'end to end'
checksums, wherein at some point in the system a checksum is generated which
accompanies the data to disk and back again and is checked on each Read
operation. IIRC IBM's i-series platforms embed this checksum (which I think
also contains disk address information to help catch 'wild' disk writes) in
each disk sector (expanded by 8 - ? - bytes to make room for it; I don't
know where the checksum is calculated and verified, but if it's in main
memory then it truly protects the data all the way to disk and back again).
ISTR that some Unisys system did the same kind of thing, and Sun does
something similar in at least one of its storage systems, as I believe does
EMC - but these last two protect only from some point outside main memory to
disk and back again.

We have generally concentrated on making the
> failures low enough probability that we don't care. We don't put
> a disk block label and disk block ECC inside data we put out to
> disk, and perform CPU validation of all disk reads to make sure
> that the data was the correct stuff and is valid.

See above: that's exactly what some systems do, albeit not necessarily in
main memory. And that's just for 'unstructured' data: it's *extremely*
common for facilities like file systems and databases to embed their own
checksums in their metadata (or structured user-data containers) - because
of the catastrophic consequences of undetected errors there.

We don't do
> that with RAM. We often don't protect memory really stringently
> from insane DMAs.

While I don't know of any system that does this, it is certainly possible
(and in some cases without additional disk access activity) to validate all
data from main-memory to disk and back again such that *any* disk error
(including both wild writes and writes that fail silently) can be caught.
Since this would also help catch a 'wild' DMA (i.e., one that deposited its
data in the wrong location - though if a retry caused the data to be
deposited in the correct location the only option would be to crash the
system, assuming that the first attempt had trashed something), it would at
least partially address that issue (for disk DMA, anyway).

>
> It is not cost effective or sensible to be making system
> overall architectures imbalanced in the reliability question.
> If there is some immutable reason that unreliability were
> inherently present in some component, then that would be
> justification. But what we find is, generally, that all
> the stuff works ok.

Only for a definition of 'OK' that balances cost against risk. And with
constantly decreasing hardware costs, that balance is constantly shifting.

When disks were ridiculously expensive, it was 'OK' to do without RAID
(especially costly mirroring) - restoring from backups was an acceptable,
though painful, option. That's no longer true today, for most people: the
cost of redundant storage is preferable to the pain of performing a restore
operation - *even though backups are still performed to protect against
other failure modes*.

...

> When we chose to inflict a significant CPU load in order to
> elevate one of those relatively equally unlikely corner cases
> above the rest, that's probably a bad overall architecture choice.

Possibly. But sometimes there's no significant additional CPU load at all.

Consider, for example, any case where data must be moved in memory by the
CPU prior to being sent to disk. When this happens, generating a CRC may
incur negligible additional overhead (sometimes even encryption can be
thrown in as well nearly for free).

And even when there *is* noticeable additional CPU overhead, if it can be
limited to rare occasions (or occasions where other significant overhead is
occurring anyway) that may be acceptable too. For example, if most small
updates to a portion of a file can be captured as small log records (which
must be CRCed for other reasons, but since they're small it doesn't add too
much overhead) and a large-block CRC must be generated only when said large,
many-times-modified block is written back to disk, the added percentage of
CPU overhead of the CRC may be fairly small (as it may when applied to very
small disk accesses, where generating the CRC pales in comparison to the
other CPU activity required by the operation).

And, of course, CPU power is getting cheaper just as fast as everything
else - so, again, the balance keeps changing.

>
> This is an internal version of the tradeoff that has to be
> made in system design, and then in operating environment
> and staffing, about reliability and cost. Using mirrored
> RAID storage in a system without ECC in the memory path
> is often silly; you have ignored a major reliability
> problem in one area and spent a lot on another, when
> a more balanced approach is better for overall reliability.

Just to be a bit picky here, RAID as usually implemented has nothing to do
with reliability: it simply promotes *availability* (continued access to
your data, without having to restore from backups or recreate from database
logs) without in any way increasing the probabilty that the data you get is
the same as the data you originally wrote (a *reliability* issue, which ECC
*does* help address, as does ECC for on-disk data).

Thus RAID can make complete sense in a system without ECC memory - if your
need for continuous operation is high but the consequences of occasional bad
data can be tolerated.

> It's similar to the problem of spending serious bucks on
> getting redundant HA servers and redundant RAID arrays
> for the data in a SAN, and then not buying a good enough
> UPS system for when the power goes out. And so on...

Except for the degree to which possibly ragged power interruptions can
affect reliability (and this *can* happen: instances where disks have
silently accepted corrupt data from main memory because the bus clamped down
before the disk did have been reported), UPSs, like RAID, are purely an
availability mechanism. But they're not the same *kind* of availability
mechanism: if the people using your system won't be able to continue to
function in a powered-down environment, then keeping your system running may
be no use at all - whereas protecting against interruptions due to disk
failure may still be very important for continuing normal operation when the
power is on.

- bill

George William Herbert

unread,

Oct 31, 2003, 5:01:09 AM10/31/03

to

Greg Lindahl <lin...@pbm.com> wrote:
>George William Herbert <gher...@gw.retro.com> wrote:
>
>>You have to trust something,
>
>It's still good practice to add some additional layers of
>error-checking. It helps you diagnose bugs faster, and it helps you
>figure out what the error rates actually are.

It appears that I may not have been as clear as I intended...

I have no problems with having the ability to do additional
checksumming.

I objected specifically to forcing the CPU to do it rather
than assuming the network interface ASIC can do so properly.

In the various situations you quoted (and I snipped for brevity)
having the ability for the CPU to double or triplecheck depending
on circumstances is beneficial, but, should not be the preferred
method used. The preferred method should be that any checking
you think needs to be done on a routine basis, should be done
in dedicated hardware. Being able to crosscheck at the CPU level
after a software switch is flipped is a debugging advantage
(we all know about single case and bad batch problems in hardware)
and should be preserved, but not routine, unless system reliability
justifies the cpu capacity 'expense'.

-george william herbert
gher...@retro.com

Terje Mathisen

unread,

Oct 31, 2003, 7:29:08 AM10/31/03

to

Del Cecchi wrote:
> You mean people don't check the packet length and crc with the hardware as
> it comes in? I'm stunned.

Read my other post about this:

It is possibly (by existence proof) to have a CRC-checked packet
received by a card which then delivers the payload over a checked bus,
and still have errors caused by hw race conditions in the FIFO buffers
in the network card itself.

Terje Mathisen

unread,

Oct 31, 2003, 8:00:30 AM10/31/03

to

George William Herbert wrote:

> Terje Mathisen <terje.m...@hda.hydro.com> wrote:
>
>>I used to be totally in this camp, but there is one very good reason for
>>having a cpu-to-cpu checksum on any kind of transfer:
>>
>>It will catch those _very_ rare but also _very_ destructive cases where
>>things that "cannot happen" still do, i.e. an Ethernet card that
>>silently drops bytes, while confirming the CRC checksum, but only if at
>>least one transmitter is breaking the Ethernet spec.
>
> The problem with this...
>
> You have to trust something, unless you're running something
> like a Stratus or Tandem box with lockstepped CPUs. Normally
> that is 'I trust the CPU'. In a SMP box, that is 'I trust the CPUs'.

[long nice description of how and why the system has to be trusted as
some point]

As I wrote in the first paragraph, I used to agree totally.

However, having been bitten twice, I still don't feel that this is
enough to require sw checksums on everything, just that the capability
of turning it on is _very_ nice to have.

In my last paragraph (which you snipped) I noted that today I do similar
transfers using rsync over ssh, which means that I instead trust the
integrity checking of the compressed+encrypted ssh channel.

BTW, we very recently got into a _very_ strange situation where a couple
of network links would crash (ending up with zero open window space in
both directions) during offsite backup operations, and we eventually
figured out that this was repeatable using any of about 4 different
files (on a given link).

Those same files would crash the link using either SMB file sharing, NFS
mounting or FTP file transfer.

From this description I immediately pointed the finger at the involved
network links, guessing that we might have located a data
pattern-related firmware bug in the io cards, or in the
compression/decompression modules. It turned out that compression was
disabled.

After nearly a month of finger-pointing, the link vendor finally
admitted that they had forgotten to enable clocking on the link ("we
assumed you wanted to do that on your router i/o cards"), so any
sufficently long stretch of nulls (probably, it might have been some
other pattern) was enough to cause it to lose sync and then reset.

BTW, if the required pattern was long enough (say larger than 8 bytes),
any sufficently good compression and/or encryption module would have
been enough to effectively mask the error "forever".

Nick Maclaren

unread,

Oct 31, 2003, 1:28:15 PM10/31/03

to

In article <gta2qv8orod60c95f...@4ax.com>,

Whereas the disadvantage of a switch is that it is a bottleneck,
and the switches are usually the limit on the performance of
large clusters. Quadrics (and, to some extent, Myrinet) "fat
switches" help, but at the cost of increasing the cost.

Regards,
Nick Maclaren.

Nick Maclaren

unread,

Oct 31, 2003, 1:37:55 PM10/31/03

to

In article <3fa1395a$0$75888$a186...@newsreader.visi.com>,

a...@pirx.lan (Andy Isaacson) writes:
|>
|> I'd also be curious to hear what your opinion of the T3E is, Nick.
|> Does it perform well enough on your MPI stress tests?

I have never used it myself, and developed them after it had
ceased to be make new sales. But, by all accounts from reliable
people, it is very good and would do well. Essentially, my
tests are just MPI_Alltoall (and MPI_Broadcast, to some extent),
implemented in half a dozen ways and on a large number of sizes,
subset sizes and subset patterns. It doesn't test anything
except the basic MPI 1 group and transfer functions, but gives
those a good hammering.

If you still have access to one, please ask me for a copy! Not
polished or fully completed in all respects, but usable ....

Regards,
Nick Maclaren.

Nick Maclaren

unread,

Oct 31, 2003, 1:53:34 PM10/31/03

to

In article <bnt304$qrh$2...@osl016lin.hda.hydro.com>,

Sigh. Yes. And similar errors, such as:

A driver bug causing the device addresses of two transfers
to get swapped, so that each got the other's data.

A race condition causing reblocking code to write to the
wrong block, causing a section of data to be replicated (and the
overwritten data lost).

I originally wrote my checksum program because of the number of
unreliable FTP daemons and clients around, but I found that some
of the failures were in TCP/IP (and not FTP) and have seen them
since with other transfer methods. Curiously, a hard core is on
systems with checksums on (and, of course) ECC. They are too
repeatable to be the known TCP checksum limit.

The corruptions I have checked have included ones that had to be
high-level errors (e.g. unflagged failure, bad recovery logic or
race conditions), as well as one that could have been the actual
read/write hardware misbehaving. And I have seen them on systems
with no code ancestry in common.

Don't ask me exactly what is going wrong, but the only thing that
will detect ALL such errors is end-to-end checksumming AND PROPER
HANDSHAKING LOGIC. A lot of DETECTED errors were and are treated
as normal end of file because of the lack of the latter.

Regards,
Nick Maclaren.

Del Cecchi

unread,

Oct 31, 2003, 1:51:41 PM10/31/03

to

"Terje Mathisen" <terje.m...@hda.hydro.com> wrote in message
news:bnt304$qrh$2...@osl016lin.hda.hydro.com...

> Del Cecchi wrote:
> > You mean people don't check the packet length and crc with the hardware
as
> > it comes in? I'm stunned.
>
> Read my other post about this:
>
> It is possibly (by existence proof) to have a CRC-checked packet
> received by a card which then delivers the payload over a checked bus,
> and still have errors caused by hw race conditions in the FIFO buffers
> in the network card itself.
>
> Terje
> --

Wouldn't that be considered a broken piece of hardware? I certainly would
consider a network card that could corrupt data internal to the card due to
a timing problem on some sort of asynchronous interface to be broken, either
by design or defect. If the data is good when it comes off the wire and on
to the card, it is up to the card to keep it that way. I guess the fact
that some folks sell broken hardware may be reason enough not to trust
anyone. But it does seem sort of strange that the same networking folks
worrying about broken hardware somehow corrupting their data blithely take
the lenght of the packet on blind faith and get caught by buffer overruns
time after time after time. Ping of Death and many others. Fool me 17
times, shame on me.

del cecchi

Robert Myers

unread,

Oct 31, 2003, 3:21:50 PM10/31/03

to

On Fri, 31 Oct 2003 07:51:41 -0600, "Del Cecchi" <cec...@us.ibm.com>
wrote:

>Wouldn't that be considered a broken piece of hardware? I certainly would
>consider a network card that could corrupt data internal to the card due to
>a timing problem on some sort of asynchronous interface to be broken, either
>by design or defect. If the data is good when it comes off the wire and on
>to the card, it is up to the card to keep it that way. I guess the fact
>that some folks sell broken hardware may be reason enough not to trust
>anyone. But it does seem sort of strange that the same networking folks
>worrying about broken hardware somehow corrupting their data blithely take
>the lenght of the packet on blind faith and get caught by buffer overruns
>time after time after time. Ping of Death and many others. Fool me 17
>times, shame on me.
>

Someone asked me yesterday where Excel had come from. I told him
about VisiCalc and how it instantly made personal computers attractive
to anyone who had ever had to prepare a budget.

How many budgets were prepared with pencil and paper, I wonder, where
the row and column sums were inconsistent because someone pencilled in
a last minute change?

People tolerate unreliable hardware because the software is even
worse. People tolerate bad hardware because, well, I think you get
the point.

By now, people are so accustomed to buggy, unreliable, and
incomprehensible hardware and software that I wonder if the situation
is even reparable.

The costs of buffer overruns just couldn't be more clear and
convincing, but people pay them regularly while trying to figure if
they couldn't squeeze a few more nickels out of the IT budget.

Never saw this message from IBM or anyone else. Are people in
marketing even afraid to touch the subject?

RM

Robert Myers

unread,

Oct 31, 2003, 4:10:22 PM10/31/03

to

Since the action is in the interconnects, why don't people spend more
time talking about them? Or don't people understand that the action
is in the interconnects?

RM

Del Cecchi

unread,

Oct 31, 2003, 5:09:26 PM10/31/03

to

"Robert Myers" <rmy...@rustuck.com> wrote in message

news:l425qv0jfudguqg8f...@4ax.com...

People do talk about them. At least some people. What do you think
InfiniBand has been doing? They are still talking about issues like
congestion management and pacing and all that stuff.

I would like to think that a non-blocking switch is just that, one which can
carry full traffic on all inputs to all outputs. Of course if 4 inputs are
sending to the same output, that is a congestion point and either the inputs
have to be throttled or they fight it out and wait. But that is
unpreventable.

del cecchi

Robert Myers

unread,

Oct 31, 2003, 6:15:32 PM10/31/03

to

On Fri, 31 Oct 2003 11:09:26 -0600, "Del Cecchi" <cec...@us.ibm.com>
wrote:

>

>"Robert Myers" <rmy...@rustuck.com> wrote in message
>news:l425qv0jfudguqg8f...@4ax.com...

<snip>

>> >
>> Since the action is in the interconnects, why don't people spend more
>> time talking about them? Or don't people understand that the action
>> is in the interconnects?
>>
>> RM
>
>People do talk about them. At least some people. What do you think
>InfiniBand has been doing? They are still talking about issues like
>congestion management and pacing and all that stuff.
>

Geez, Del. I throw you an easy fastball to swing at and you hit a
line drive right back at the pitcher. :-).

Okay, so I have a little catching up to do. I went to the IBM web
site and searched on Infiniband and got 92 hits.

What scares me about infiniband is that it has already swallowed alot
of money and mostly killed off small companies.

Intel has its foot in mid-air and hasn't decided where to step or who
to squash.

If 10G ethernet over copper becomes a reality, then either infiniband
changes its marketing and pricing strategy or it becomes a legacy
technology. And, of course, ethernet doesn't mean you have to be a
moron and use TCP/IP.

RM

Nick Maclaren

unread,

Oct 31, 2003, 6:31:46 PM10/31/03

to

In article <bnu509$k3q$1...@news.rchland.ibm.com>,

Del Cecchi <cec...@us.ibm.com> wrote:
>
>> Since the action is in the interconnects, why don't people spend more
>> time talking about them? Or don't people understand that the action
>> is in the interconnects?
>

>People do talk about them. At least some people. What do you think
>InfiniBand has been doing? They are still talking about issues like
>congestion management and pacing and all that stuff.

People do, indeed, talk about them. At length. My prediction is that
InfiniBand will be talking about such things for the indefinite future.

>I would like to think that a non-blocking switch is just that, one which can
>carry full traffic on all inputs to all outputs. Of course if 4 inputs are
>sending to the same output, that is a congestion point and either the inputs
>have to be throttled or they fight it out and wait. But that is
>unpreventable.

I should like to think that, too, but one cannot always do what one
likes.

I am thoroughly unconvinced of the practicability of a non-blocking
switch for 1,000 ports at the maximum speed of a point-to-point
interconnect of ANY technology. I certainly await evidence that any
such exists.

Regards,
Nick Maclaren.

Stephen Fuld

unread,

Oct 31, 2003, 6:59:34 PM10/31/03

to

"Del Cecchi" <cec...@us.ibm.com> wrote in message
news:bnrlbq$rkk$1...@news.rchland.ibm.com...

AFAIK, virtually all but the highest end network cards do not check the TCP
checksum at all. Of course, TCP makes a little harder things harder for the
hardware by not putting the checksum at the end of the packet. The newer
high end TCP Offload Engines (TOEs) do check the checksum and perform other
functions, but they are still relativly rare. And a significant reason why
they were developed is the problems many systems were having keeping up with
all the software required to handle TCP when you have multiple 1 Gb Ethernet
cards or future 10 Gb cards on a server.

Jonathan Stone

unread,

Oct 31, 2003, 6:57:19 PM10/31/03

to

In article <NSbob.199957$0v4.15...@bgtnsc04-news.ops.worldnet.att.net>,

Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:
>
>The idea that a processor should be used to calculate a checksum is one born
>out of the peculiar blindness of some people with a network background who
>can't get beyond thinking that TCP/IP is "the" protocol. In most cases, a
>reasonably designed protocol will allow the checksum to be
>calculated/checked in a modest piece of dedicated hardware as the data flows
>by on its way into or out of the chip.

Stephen, I'm genuinely sorry if I sound offensive here, but your
message is, at the very least, woefully under-informed about the state
of the art in checksum offload. Your comments also paint a rather
broad brush with what seems rather like contempt for networking
people.

Yes, indeed hardwdare can compute checksums. But of course such a
checksum covers only the path between that dedicated piece of
hardware, and the similar piece of hardware at the next hop which
verifies the checksum. It does not cover the path to and from main
memory buffers, which is what networking people mean by "end-to-end".

Data corruption errors can and do occur on that path. Outboard
checksum hardware won't find them (modulo double-checksum/dobule-DMA
tricks like IEN-45).

Nevermind the well-known demonstrated fact that hardware weenies
simply cannot get all the corner cases of TCP checksum offload right,
at least not in the mass-market cards available on today's market.

Stephen Fuld

unread,

Oct 31, 2003, 7:09:42 PM10/31/03

to

"Nick Maclaren" <nm...@cus.cam.ac.uk> wrote in message
news:bntpgu$qtp$1...@pegasus.csx.cam.ac.uk...

>
> In article <bnt304$qrh$2...@osl016lin.hda.hydro.com>,
> Terje Mathisen <terje.m...@hda.hydro.com> writes:
> |> Del Cecchi wrote:
> |> > You mean people don't check the packet length and crc with the
hardware as
> |> > it comes in? I'm stunned.
> |>
> |> Read my other post about this:
> |>
> |> It is possibly (by existence proof) to have a CRC-checked packet
> |> received by a card which then delivers the payload over a checked bus,
> |> and still have errors caused by hw race conditions in the FIFO buffers
> |> in the network card itself.
>
> Sigh. Yes. And similar errors, such as:
>
> A driver bug causing the device addresses of two transfers
> to get swapped, so that each got the other's data.
>
> A race condition causing reblocking code to write to the
> wrong block, causing a section of data to be replicated (and the
> overwritten data lost).

If you want to use an additional checksum as a software debugging feature,
that is fine with me. But then I don't want to hear from anyone (I know you
wouldn't do this, Nick) about the extra overhead of the code for
subscript/pointer checking. I would be a lot of money that those types of
errors are far more frequent than the ones you have described above.

snip

> Don't ask me exactly what is going wrong, but the only thing that
> will detect ALL such errors is end-to-end checksumming AND PROPER
> HANDSHAKING LOGIC. A lot of DETECTED errors were and are treated
> as normal end of file because of the lack of the latter.

So you do end to end checksumming in software on all disk transfers? I
thought not. Why is that different?

Rick Jones

unread,

Oct 31, 2003, 7:16:00 PM10/31/03

to

Stephen Fuld <s.f...@pleaseremove.att.net> wrote:
> AFAIK, virtually all but the highest end network cards do not check
> the TCP checksum at all. Of course, TCP makes a little harder
> things harder for the hardware by not putting the checksum at the
> end of the packet. The newer high end TCP Offload Engines (TOEs) do
> check the checksum and perform other functions, but they are still
> relativly rare. And a significant reason why they were developed is
> the problems many systems were having keeping up with all the
> software required to handle TCP when you have multiple 1 Gb Ethernet
> cards or future 10 Gb cards on a server.

Checksum Offload (aka CKO) goes back to the early-mid 1990's (perhaps
farther) when it was encorporated into a number of FDDI interfaces
from the likes of HP, SGI and DEC. (and I'm sure others) I know (from
being told, not from participating :) that the internal work that went
into the HP HP-PB FDDI NIC with CKO actually started with a project
that began in 1988.

Now, it would seem to be true that CKO didn't become prevalent in
"commodity" NICs until GbE NICs (not sure if it was in some "server"
100BT NICs or not - wasn't in the Tulip-based stuff used on the HP
9000 anyway...), by then either the chicken and egg of NICs with CKO
and "commodity" stacks with CKO was solved, or the major vendors had
managed to convince the NIC vendors to include it as part of switching
to those NICs for their higher-end systems and not just PC's.
Perahaps a bit of both.

Whether a Gigabit Ethernet NIC is considered a highest-end NIC
probably depends in no small part on from where one is looking...

rick jones
--
Process shall set you free from the need for rational thought.
these opinions are mine, all mine; HP might not want them anyway... :)
feel free to post, OR email to raj in cup.hp.com but NOT BOTH...

Stephen Fuld

unread,

Oct 31, 2003, 7:19:52 PM10/31/03

to

"George William Herbert" <gher...@gw.retro.com> wrote in message
news:bns3d7$o29$1...@gw.retro.com...

> Terje Mathisen <terje.m...@hda.hydro.com> wrote:
> >I used to be totally in this camp, but there is one very good reason for
> >having a cpu-to-cpu checksum on any kind of transfer:
> >
> >It will catch those _very_ rare but also _very_ destructive cases where
> >things that "cannot happen" still do, i.e. an Ethernet card that
> >silently drops bytes, while confirming the CRC checksum, but only if at
> >least one transmitter is breaking the Ethernet spec.
>
> The problem with this...

snipped the rest of an excellent presentation.

I agree totally. I think that historically if TCP/IP had been designed
originally to be more "hardware friendly" such that even low cost and easy
hardware did the checking as a matter of course, very few software
implementations would do the checksumming normally (though they might for
debugging purposes) and we wouldn't be having this argument. Look at things
like Fibre Channel or Infiniband, which were designed this way. No one
software checks their checksums. It seems to me that people are now making
a virtue out of a necessity. I don't blame the original TCP/IP architects
for lack of foresight, though it certainly would be better for all of us if
they had more of it. :-)

Rick Jones

unread,

Oct 31, 2003, 7:21:24 PM10/31/03

to

Jonathan Stone <jona...@pescadero.dsg.stanford.edu> wrote:
> Yes, indeed hardwdare can compute checksums. But of course such a
> checksum covers only the path between that dedicated piece of
> hardware, and the similar piece of hardware at the next hop which
> verifies the checksum. It does not cover the path to and from main
> memory buffers, which is what networking people mean by "end-to-end".

I think it might be more accurate to state that it is between that
dedicated hardware (in the NIC) and the "next to last hop" rather than
"next hop." It is still not strictly conforming to a memory-to-memory
interpretation of end-to-end, but it will be from end-station to
end-station.

The "tensions" between the need/desire for a "zero-copy" solutuion and
strict adherence to memory-to-memory end-to-end in networking are of
course, a very firtile ground for discussions :)

Now, if only those pesky hardware guys could get memory keeping-up
with processors maybe we wouldn't have to worry about it anymore... :)

rick jones
--
a wide gulf separates "what if" from "if only"

Nick Maclaren

unread,

Oct 31, 2003, 7:23:18 PM10/31/03

to

In article <Wryob.201371$0v4.15...@bgtnsc04-news.ops.worldnet.att.net>,

Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:
>
>If you want to use an additional checksum as a software debugging feature,
>that is fine with me. But then I don't want to hear from anyone (I know you
>wouldn't do this, Nick) about the extra overhead of the code for
>subscript/pointer checking. I would be a lot of money that those types of
>errors are far more frequent than the ones you have described above.

No, I wouldn't! My estimate is that they are hundreds of times more
frequent, overall, but the distribution between applications is very
different.

>> Don't ask me exactly what is going wrong, but the only thing that
>> will detect ALL such errors is end-to-end checksumming AND PROPER
>> HANDSHAKING LOGIC. A lot of DETECTED errors were and are treated
>> as normal end of file because of the lack of the latter.
>
>So you do end to end checksumming in software on all disk transfers? I
>thought not. Why is that different?

It isn't :-) The reasons why not are partly what you say above and
partly because the I/O interfaces are very poorly designed for such
robustness. They COULD be designed to enable it, and I certainly
agree that is a good idea. For example:

A sequential file could have a checksum calculated, and stored at
intervals along its length. This would still allow update, but with
the Fortran/IBM BSAM semantics.

A direct-access file could have a checksum calculated per block,
INCLUDING the offset. This could also be done for sequential files.

Note that these techniques would detect block shuffling errors,
which would help vastly for highly parallel, asynchronous filing
systems and/or defragmentation utilities.

Regards,
Nick Maclaren.

Stephen Fuld

unread,

Oct 31, 2003, 7:24:16 PM10/31/03

to

"Terje Mathisen" <terje.m...@hda.hydro.com> wrote in message

news:bnrqt2$4s2$1...@osl016lin.hda.hydro.com...

You have just stated an argument that the software should add a checksum to
all disk transfers. I presume you don't do that. Why not? You can have
the same kinds of errors that would discussed with a SCSI interface card as
you have had with an Ethernet interface card.

Jonathan Stone

unread,

Oct 31, 2003, 7:27:13 PM10/31/03

to

In article <qiyob.201360$0v4.15...@bgtnsc04-news.ops.worldnet.att.net>,
Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:

>
>AFAIK, virtually all but the highest end network cards do not check the TCP

>checksum at all. [...]

Point of fact: $40 consumer-grade cards have TCP and IP-header checksum
offload. Do you really think a $40 card counts as "the highest end"?

At this point I should say that there's very little good empirical
data on the actual rates of damaged TCP (or other) packets `in the
wild' on real networks. What little data there is is described in a
couple of papers by myself, my co-advisor and colleagues. The observed
rates of damaged packets was disturbingly high. Detailed analysis of
many of the damaged packets (comparing the damaged data to a later TCP
retransmission)d0 yielded very strong evidence that errors were
occurring between the sending TCP and the sending NIC: errors which
will not be caught by an an outboard checksum engine.

Anyone contemplating outboard TCP/IP checksum offload engines should
at least be *aware* of those data.

>Of course, TCP makes a little harder things harder for the
>hardware by not putting the checksum at the end of the packet. The newer
>high end TCP Offload Engines (TOEs) do check the checksum and perform other
>functions, but they are still relativly rare. And a significant reason why
>they were developed is the problems many systems were having keeping up with
>all the software required to handle TCP when you have multiple 1 Gb Ethernet
>cards or future 10 Gb cards on a server.

There's a very small grain of truth in there, but its _very_ small.
For contemporary machines, the biggest problem even a low-end consumer
desktop machines has in keeping up with a single gigabit Ethernet
stream is I/O bus bottleneck (32-bit/33MHz PCI is typical on
desktops). Contemporary i386 desktop machines have a private bus
between an onboard gig-e device and main memory.

Contemporary server-class machines have PCI-X busses and are quite
capable of sinking multiple gigabit streams. Check out the Internet
Land Speed record numbers with the Intel Pro/10GbE cards, P4 Xeons,
Intel se7501 chipsets. If you have to touch the TCP data stream *at
all*, a folded copy/checksum has negligible marginal cost over a data
copy. That's been well-known since, oh, see several SIGCOMM papers in
the 1994-1995 time frame.

IMNSO, one of the primary drivers for ToE engines is to implement
iSCSI, using low-cost/embbeded CPUs, and to aid implementing iSCSI
interfaces which provide the same interface to software as existing
FC/SCSI HBA interfaces. The ill-considered choice of a link-level CRC
as iSCSI application-level check have something to do with it, too.

(BTW, I agree with you 100% about placing checksums at the end of
packets. I wanted to do that for SCTP, but it was a long uphill
battle and was considered too much of a change).

Jonathan Stone

unread,

Oct 31, 2003, 7:44:32 PM10/31/03

to

In article <bntpdg$tdi$1...@news.rchland.ibm.com>,
Del Cecchi <cec...@us.ibm.com> wrote:

>But it does seem sort of strange that the same networking folks
>worrying about broken hardware somehow corrupting their data blithely take
>the lenght of the packet on blind faith and get caught by buffer overruns
>time after time after time. Ping of Death and many others. Fool me 17
>times, shame on me.

Scuse me, what's a Ping of Death? Oh, you mean a Linux bug? Many
"networking folks" would take offense with being tarred with that
label. There's good reason why the BSD networking stack was the
defacto standard in TCP/IP networking or nearly two decades.

Your comment is about as ill-founded as flaming "the same hardware people"
-- including you at IBM -- solely on the grounds of the Pentium FPU bug.

Jonathan Stone

unread,

Oct 31, 2003, 7:53:21 PM10/31/03

to

In article <Qxyob.8193$ks7....@news.cpqcorp.net>,
Rick Jones <f...@bar.baz.invalid> wrote:
>Stephen Fuld <s.f...@pleaseremove.att.net> wrote:

>Now, it would seem to be true that CKO didn't become prevalent in
>"commodity" NICs until GbE NICs (not sure if it was in some "server"
>100BT NICs or not - wasn't in the Tulip-based stuff used on the HP
>9000 anyway...),

Rick

The latest Pro/100 family (82557? 82550?) has checksum offload and has
dome for what, 18 months now. I understand the Pro/1000 variants with
checksum offload and IPsec offload can do only one of those at a time
(checksum offload, or IPsec offload, not both).

OTOH these days, GbE *is* commodity: around $40 for a 32-bit PCI GbE
card with TCP/IP checksum offload, jumbo frames, usw.

And I stand by my earlier comment: the hardware vendors demonstrably
just did not get checksum offload right in all the various corner
cases. That's a fact. A rather well-known fact in the arena of people
who write open-source device drivers for the darn things.

>Whether a Gigabit Ethernet NIC is considered a highest-end NIC
>probably depends in no small part on from where one is looking...

Six months ago, maybe, but in the interim, the market has overtaken
you. GbE is commodity. About six months ago a certain high-end
switch vendors dropped their price per port for GbE to the exact same
price point as 10/100 ports. "Low-end" desktop boards can now be
bought for around $40, using the same cores (possibly price-reduced or
22-bit versions) of what were "server" cards selling for $100 just a
year ago.

Jonathan Stone

unread,

Oct 31, 2003, 7:59:47 PM10/31/03

to

In article <bnrqt2$4s2$1...@osl016lin.hda.hydro.com>,
Terje Mathisen <terje.m...@hda.hydro.com> wrote:
>Stephen Fuld wrote:

[...]

>I used to be totally in this camp, but there is one very good reason for
>having a cpu-to-cpu checksum on any kind of transfer:
>
>It will catch those _very_ rare but also _very_ destructive cases where
>things that "cannot happen" still do, i.e. an Ethernet card that
>silently drops bytes, while confirming the CRC checksum, but only if at
>least one transmitter is breaking the Ethernet spec.

Its not so rare. And it doesn't require transmitters to break the
Ethernet spec: it can happen on the receiver side, too.

Every few years the idea of relying on the link-level CRC for local
traffic resurfaces in the networking research community. I've
personally seen people surrounded by well-known researchers wielding
plastic tableware in theatrical dagger poses.

>I've seen this happen with at laest two different "server-level" network
>cards, working inside boxes with at some form of ECC on all buses.
>
>I.e. all i/o was protected at all levels, but we still got silent data
>corruption about once a day or so, when replicating GB's of data to
>multiple servers.
>

>Today I would have used rsync on top of ssh instead.

Doesn't help if the problem is on the receving NIC, and the receiving
NIC has a FIFO which occasionally drops a 16-bit word. Or excahnges
two 16-bit words in the interface betwen a 16-bit chip and a 32-bit
bus (hp300s, hello Rick!). Or the DMA engine in the NIC munges a high-order
bit in a DMA pointer once in a while. I've seen __all__ of these.

The worst story I heard was about a large NFS server in the 1980s
[i.e., in the window where Sun turned off UDP checksumming for NFS],
with a hardware problem in the server's receive-side NIC. Writes from
NFS client to the server disk would get corrupted by the server NIC.
The backups of the affectd data were thus also corrupted. Problem
wasn't detected for *months*, by which point people were scurtying to
find hard-copy printout they could get retyped.

Rick Jones

unread,

Oct 31, 2003, 8:01:52 PM10/31/03

to

Jonathan Stone <jona...@pescadero.dsg.stanford.edu> wrote:
> (BTW, I agree with you 100% about placing checksums at the end of
> packets. I wanted to do that for SCTP, but it was a long uphill
> battle and was considered too much of a change).

How many NICs (gbit and such) these days "stream" data to/from the
host rather than store-and-forward it? Doesn't that make a difference
as to the valueof a trailer checksum? At one point there was indeed a
draft RFC for an option to TCP to allow trailer checksums (Subramian
(sp) and Bridges IIRC) but it went nowhere in the IETF.

rick jones
--
oxymoron n, Hummer H2 with California Save Our Coasts and Oceans plates

Nick Maclaren

unread,

Oct 31, 2003, 8:17:56 PM10/31/03

to

In article <bnud2h$dna$1...@Pescadero.DSG.Stanford.EDU>,

Jonathan Stone <jona...@Pescadero.DSG.Stanford.EDU> wrote:
>
>At this point I should say that there's very little good empirical
>data on the actual rates of damaged TCP (or other) packets `in the
>wild' on real networks. What little data there is is described in a
>couple of papers by myself, my co-advisor and colleagues. The observed
>rates of damaged packets was disturbingly high. Detailed analysis of
>many of the damaged packets (comparing the damaged data to a later TCP
>retransmission)d0 yielded very strong evidence that errors were
>occurring between the sending TCP and the sending NIC: errors which
>will not be caught by an an outboard checksum engine.

Yes. I have never written up what I have found, largely because the
results were found sporadically over a decade or two, mixed in with
hundreds of other investigations. And I have kept few notes!

My observations are that a significant proportion (perhaps as high as
50%) of 'network errors' were definitely in the software, at a higher
level than simple checksums. I.e. that proportion of the problems
were block corruption of various forms. I have seen that on many
different TCP/IP stacks, including ones with no ancestry in common.

What I can't say is exactly how or where they happened, but I have
had some evidence that at least some were due to faulty error recovery.
That is, of course, precisely where you would expect coding and even
design errors to lurk in software that has been used heavily for
many years.

The few cases of 'simple' data corruption in disk I/O that I have seen
have had similar characteristics.

But, of course, these were only the transmission errors that got
noticed, were in a file copy, and where I checked up both the input
and output files. My gut feeling is that there are a lot more of
them than most people think, and they are put down to a transient
and undetermined cause (often not even transmission) or simply not
noticed.

Without a much improved end-to-end error checking mechanism, it is
very hard to know what error rates REALLY are.

Regards,
Nick Maclaren.

Jonathan Stone

unread,

Oct 31, 2003, 8:18:28 PM10/31/03

to

In article <3fa196e0$1...@news.meer.net>, Greg Lindahl <lin...@pbm.com> wrote:
>In article <bns3d7$o29$1...@gw.retro.com>,
>George William Herbert <gher...@gw.retro.com> wrote:

>
>What do you think the right response to the situation outlined in
>this paper is?

A teaser: I think Tony Hain's answer was a pretty good one. But even
after removing the dominant single source in that paper as an
"artifact", the remamining observed error rates were disturbingly high.

I really must find time to write up the last two chapters of my
dissertation as a paper: turns out that the response taken by iSCSI
and SCP to that paper was, in retrospect, not such a good one.

Jonathan Stone

unread,

Oct 31, 2003, 8:13:37 PM10/31/03

to

In article <bns3d7$o29$1...@gw.retro.com>,
George William Herbert <gher...@gw.retro.com> wrote:

>Terje Mathisen <terje.m...@hda.hydro.com> wrote:
>The problem with this...
>
>You have to trust something, unless you're running something
>like a Stratus or Tandem box with lockstepped CPUs. Normally
>that is 'I trust the CPU'. In a SMP box, that is 'I trust the CPUs'.

George:

One big difference between disks and networks is that disks are
"slaves" they don't spontaneoulsy generate large volumes of very
busrsty traffic -- at least not without being instructed to do so.

One further major difference is that disks generate well-behaved DMA
traffic: integral numbers of disk blocks, usually some power-of-two
multiple of 512 byte blocks.

Network devices are often much more ill-behaved, doing nastier small
scatter-gather DMAs (for example, delivering Ethernet/IP/TCP packets
into memory so that the IP/TCp headers are 32-bit aligned).

Network devices typically have some minimal frame length: higher-level
protocols have to pad shorter packets out to that frame length, which
is another fruitful source of errors in TCP/IP cheksum offload
implementations. There's also typically some maximum frame length:
layer-3 packets larger than that frame length must be fragmented
by the sender and reassembled by the reciever. Attempting to do
outboard checksums overr fragmented packets, whilst delivering
individual frames to softare, but not getting the NIC-checksum-assist
reassembly quite consistent with software reassembly, is another
fruitful source of errors in (UDP) checksum assist.

(ps: anything to do with sci.space)?

Terje Mathisen

unread,

Oct 31, 2003, 8:20:48 PM10/31/03

to

Stephen Fuld wrote:

> "Terje Mathisen" <terje.m...@hda.hydro.com> wrote in message

>>I.e. all i/o was protected at all levels, but we still got silent data
>>corruption about once a day or so, when replicating GB's of data to
>>multiple servers.
>
> You have just stated an argument that the software should add a checksum to

I agree Stephen. Consistency is good! :-)

> all disk transfers. I presume you don't do that. Why not? You can have
> the same kinds of errors that would discussed with a SCSI interface card as
> you have had with an Ethernet interface card.

The operative word here is 'can', as in "You can have the same kinds of
errors".

The main reason I'm not advocating this for all disk io is simply that I
haven't actually seen this happen, or at least not without the disk hw
detecting the error after the fact, in the form of a broken ECC code.

For critical/sequential data I get exactly this sort of protection by
packing the files into zip/gz/tgz archives, where everything is checked
with CRC32 or similar codes.

Terje

--
- <Terje.M...@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"

Jonathan Stone

unread,

Oct 31, 2003, 8:24:05 PM10/31/03

to

In article <bnsqal$qor$1...@gw.retro.com>,

George William Herbert <gher...@gw.retro.com> wrote:

>I objected specifically to forcing the CPU to do it rather
>than assuming the network interface ASIC can do so properly.

Point #1: The empirical evidence is that the NIC designers have a very
poor track record of getting this completley right. Why do you
persist in assuming that the NIC ASICs get it right when the evidence
is that, historicaly they *don't* get it 100% right?

(An analogy to NASA achieiving COTS is on the tip of my tongue.)

Point #2: the empirical data shows fairly strongly that even if the
NIC ASIC designers *were* to implement checksum offload 100%
correctly, there are non-trivial sources of errors which occur in the
path between the main memory and the NIC asic checksum engine. The
NIC ASIC cannot protect against those errors: some (but not all)
of those errors are in any case beyond their control.

Greg Lindahl

unread,

Oct 31, 2003, 8:49:10 PM10/31/03

to

In article <AFyob.201388$0v4.15...@bgtnsc04-news.ops.worldnet.att.net>,
Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:

>You have just stated an argument that the software should add a checksum to
>all disk transfers. I presume you don't do that. Why not? You can have
>the same kinds of errors that would discussed with a SCSI interface card as
>you have had with an Ethernet interface card.

In fact, this is an important thing to think about. Why do high-end
RAID vendors tend to only support certain firmware revs in disks?
Well, the ECC generation and check in a disk system is implemented in
disk firmware...

Were I designing a RAID controller, especially one destined to work
with arbitrary disks, I would definitely add a higher level CRC. And I
would add enough cpu power to the controller to let it run 100% of the
time.

greg

Terje Mathisen

unread,

Oct 31, 2003, 8:48:16 PM10/31/03

to

Jonathan Stone wrote:
> Contemporary server-class machines have PCI-X busses and are quite
> capable of sinking multiple gigabit streams. Check out the Internet
> Land Speed record numbers with the Intel Pro/10GbE cards, P4 Xeons,
> Intel se7501 chipsets. If you have to touch the TCP data stream *at
> all*, a folded copy/checksum has negligible marginal cost over a data
> copy. That's been well-known since, oh, see several SIGCOMM papers in
> the 1994-1995 time frame.

I wrote/posted one of the first versions of this several years earlier,
you can find it mentioned in one of Mike Abrash' x86 optimization books.

I figured out that the inner loop can be written _very_ compactly on x86
machines, at very close to maximum speed (this is from >10 year old memory):

next4:
mov [edi],edx ; Store previous word
mov edx,[edi+esi] ; Load next source word

lea edi,[edi+4] ; Update source & dest pointer
adc eax,edx ; Accumulate & wrap previous carry

dec ecx ; More words to do? (CARRY NOT MODIFIED!)
jnz next4 ; Yes, go on!

I believe this is one of the more elegant pieces of code I have ever
written, it just fits perfectly within the x86 instruction set:

On a 486 this took one cycle/instruction, except for the JNZ which was a
little slower afair.

On a Pentium the exact same code paired perfectly, resulting in 3 cycles
per 4 bytes.

On a 3-issue machine which can handle load/store operations two cycles
after the address has been updated, the loop can run in just two cycles
per iteration.

The best method today is probably to use even wider registers (64 or
128-bit wide, with explicit accumulation of any carries, using packed
compares and a separate carry accumulate register.

When sending data I'd use cache-bypassing store operations to avoid
L1/L2 pollution.

Today the real cost of receiving some data isn't the need to do the
checksum, but the fact that all (hopefully!) of it has to be faulted
into the L1 cache before the application can use it.

Whether the overhead of doing this is part of the network driver or the
application doesn't really matter.

Thomas Womack

unread,

Oct 31, 2003, 8:53:05 PM10/31/03

to

In article <bnuejh$dsm$1...@Pescadero.DSG.Stanford.EDU>,

Jonathan Stone <jona...@Pescadero.DSG.Stanford.EDU> wrote:
>In article <Qxyob.8193$ks7....@news.cpqcorp.net>,
>Rick Jones <f...@bar.baz.invalid> wrote:
>>Stephen Fuld <s.f...@pleaseremove.att.net> wrote:
>
>>Now, it would seem to be true that CKO didn't become prevalent in
>>"commodity" NICs until GbE NICs (not sure if it was in some "server"
>>100BT NICs or not - wasn't in the Tulip-based stuff used on the HP
>>9000 anyway...),
>
>Rick
>
>The latest Pro/100 family (82557? 82550?) has checksum offload and has
>dome for what, 18 months now. I understand the Pro/1000 variants with
>checksum offload and IPsec offload can do only one of those at a time
>(checksum offload, or IPsec offload, not both).
>
>OTOH these days, GbE *is* commodity: around $40 for a 32-bit PCI GbE
>card with TCP/IP checksum offload, jumbo frames, usw.

Migods! I knew that the interface was commodity -- it's been on
medium-end and above Apples for ages, possibly because whole-laptop
backup is one of the obvious normal-user applications for gigabit; it
starting appearing as a standard in-built feature on the Intel P4
laptops in about April. But I'd always thought the switches were too
expensive; so I'm amazed to discover that my normal supplier will sell
me a five-port gigabit switch for 70 pounds, delivering tomorrow.

I don't know whether the same will happen with 10GbE. Fibre doesn't
seem very civilised stuff to have random-users working with, and I
have trouble believing in 10Gb over copper. The 10Gb fibre interface
cards are four thousand pounds but reasonably widely available
(http://att.dealtime.com/xPO-Intel_PRO_10GbE_LR_Server_Adapter_PXLA8590LR);
I don't have any idea even where I'd look for a 10Gb switch. It's
much too expensive to use as a to-every-node cluster interconnect; I
wonder if there are applications which would work nicely with a
within-rack gigabit ethernet and switched 10Gb between racks.

Tom

Rick Jones

unread,

Oct 31, 2003, 8:56:16 PM10/31/03

to

Jonathan Stone <jona...@pescadero.dsg.stanford.edu> wrote:
> Or excahnges two 16-bit words in the interface betwen a 16-bit chip
> and a 32-bit bus (hp300s, hello Rick!).

Sure - and I can personally attest to a time in the late 1980's
(perhaps '89) where internally, we had some data corruption stemming
from a busted NIC (old CIO interface on a classic 3000), and where the
NS Transport didn't catch it because it was willing to make the
checksum optional for TCP for local traffic :)

That still doesn't mean I'm uncomfortable enough with CKO to say to
toss it - I have at least as much faith in a CKO NIC as I do in say,
a RAID controller... :)

rick jones
--
a wide gulf separates "what if" from "if only"

Nick Maclaren

unread,

Oct 31, 2003, 9:03:36 PM10/31/03

to

In article <rbw*Sl...@news.chiark.greenend.org.uk>,

Thomas Womack <two...@chiark.greenend.org.uk> wrote:
>>
>>OTOH these days, GbE *is* commodity: around $40 for a 32-bit PCI GbE
>>card with TCP/IP checksum offload, jumbo frames, usw.
>
>Migods! I knew that the interface was commodity -- it's been on
>medium-end and above Apples for ages, possibly because whole-laptop
>backup is one of the obvious normal-user applications for gigabit; it
>starting appearing as a standard in-built feature on the Intel P4
>laptops in about April. But I'd always thought the switches were too
>expensive; so I'm amazed to discover that my normal supplier will sell
>me a five-port gigabit switch for 70 pounds, delivering tomorrow.

Watch out. With both Fast and Gigabit Ethernet, the first cards and
switches were dire - typically little faster than the best of the
previous generation. It isn't impossible that a really cheap switch
is one of the ones that manages to block badly even with 5 ports!

>I don't know whether the same will happen with 10GbE. Fibre doesn't
>seem very civilised stuff to have random-users working with, and I
>have trouble believing in 10Gb over copper. The 10Gb fibre interface
>cards are four thousand pounds but reasonably widely available
>(http://att.dealtime.com/xPO-Intel_PRO_10GbE_LR_Server_Adapter_PXLA8590LR);

Fibre is a LOT more civilised than copper. Remember that the Gigabit
fibre was tested in HiPPI for quite a while before it hit Ethernet.
We are actually using HiPPI cables for Gigabit Ethernet :-)

I have heard that 10Gb Ethernet uses the same fibre, but don't know
how it performs, and haven't yet seen many real performance measurements
on 10Gb Ethernet. Some of the cards might well deliver only 2 Gb/sec.

>I don't have any idea even where I'd look for a 10Gb switch. It's
>much too expensive to use as a to-every-node cluster interconnect; I
>wonder if there are applications which would work nicely with a
>within-rack gigabit ethernet and switched 10Gb between racks.

There are, except that the latency is rather high, and very high
indeed if you use TCP/IP. That doesn't always matter, but sometimes
does.

Regards,
Nick Maclaren.

Zak

unread,

Oct 31, 2003, 9:04:15 PM10/31/03

to

Rick Jones wrote:

> Now, it would seem to be true that CKO didn't become prevalent in
> "commodity" NICs until GbE NICs (not sure if it was in some "server"
> 100BT NICs or not - wasn't in the Tulip-based stuff used on the HP
> 9000 anyway...), by then either the chicken and egg of NICs with CKO
> and "commodity" stacks with CKO was solved, or the major vendors had
> managed to convince the NIC vendors to include it as part of switching
> to those NICs for their higher-end systems and not just PC's.
> Perahaps a bit of both.

And these days, even some versions of the lowly 100 Mb/s Realtek 8139
seem to offer checksum and segmentation offload.

http://www.realtek.com.tw/products/products1-2.aspx?modelid=7

Thomas

Jonathan Stone

unread,

Oct 31, 2003, 9:06:25 PM10/31/03

to

In article <sByob.22691$Ec1.2...@bgtnsc05-news.ops.worldnet.att.net>,
Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:

>snipped the rest of an excellent presentation.
>
>I agree totally. I think that historically if TCP/IP had been designed
>originally to be more "hardware friendly" such that even low cost and easy
>hardware did the checking as a matter of course, very few software
>implementations would do the checksumming normally (though they might for
>debugging purposes) and we wouldn't be having this argument.

If you have a *genuine* interest in this topic, you really should take
the time to read TCP/IP design documents like IEN-45 (e.g.,
ftp.isi.edu:in-notes/ien/ien45.txt), and see what the IP/TCP designers
*actually* thought about checksum offload.

Or you could ask someone who remembers having their files trashed in
the 1980s as a consequence of NFS relying on an outboard checksum, and
being hit by errors outside the coverage of that outboard checksum.

Or you can ignore the historical record, and go on making up stuff to
suit your own agenda. But denigrating the entire networking field
based on such made-up stuff isn't going to win you many points.

> Look at things
>like Fibre Channel or Infiniband, which were designed this way.

And look carefully at their failure semantics (especially Inifinband),
and look at who is deploying either one as general-purpose transport
protocols. (Even ATM did better than that.)

Del Cecchi

unread,

Oct 31, 2003, 9:14:08 PM10/31/03

to

"Jonathan Stone" <jona...@Pescadero.DSG.Stanford.EDU> wrote in message
news:bnue30$dra$1...@Pescadero.DSG.Stanford.EDU...

Apologies for any offense. It was not attended. As a hardware person, I'm
not trying to start a war here. Clearly there is fault on both sides when
systems are unreliable.

I don't think the "ping of death" was strictly a Linux thing. It affected,
as I recall, several unices perhaps including AIX. I don't remember. It
was the bug where a ping was sent with 64KB of data attached which caused
the receiving system to crash. Here is what popped up at the top of google.

"On the Internet, ping of death is a denial of service (DoS) attack caused
by an attacker deliberately sending an IP packet larger than the 65,536
bytes allowed by the IP protocol. One of the features of TCP/IP is
fragmentation; it allows a single IP packet to be broken down into smaller
segments. In 1996, attackers began to take advantage of that feature when
they found that a packet broken down into fragments could add up to more
than the allowed 65,536 bytes. Many operating systems didn't know what to do
when they received an oversized packet, so they froze, crashed, or rebooted"

So it wasn't a Linux vulnerability, but related to TCP/IP and related
software in various OS. Why did you single out Linux?

There have been and still seem to be a number of exploits related to
receiving more data than was expected, in which the extra data is written
into areas in which it isn't supposed to be. This is the "buffer over run"
trick, right? And these packets normally are associated with networks,
hence my generic reference to "networking folks". Again, no intention to
start a flame war here.

Now you can explain why so much hardware/software associated with networking
seems to be vulnerable to damage by unexpected data characteristics.

As I recall the FP bug, the only thing notable about it was Intel trying to
say it didn't matter. Pretty much all hardware has bugs. As does software.
The real question is what one does about it in the future.

del cecchi

Nick Maclaren

unread,

Oct 31, 2003, 9:59:43 PM10/31/03

to

In article <bnujb1$sm4$1...@news.rchland.ibm.com>,

Del Cecchi <cec...@us.ibm.com> wrote:
>
>I don't think the "ping of death" was strictly a Linux thing. It affected,
>as I recall, several unices perhaps including AIX. I don't remember. It
>was the bug where a ping was sent with 64KB of data attached which caused
>the receiving system to crash. Here is what popped up at the top of google.

It wasn't. And that was not the only way in which pings could cause
denials of service. Few, if any, TCP/IP stacks were wholly immune
from bugs of that nature.

>As I recall the FP bug, the only thing notable about it was Intel trying to
>say it didn't matter. Pretty much all hardware has bugs. As does software.
>The real question is what one does about it in the future.

Not quite. That was the SECOND issue. The FIRST was that Intel had
previously reduced their time to market by removing one empirical,
post-production validation stage on the grounds that the hardware
design tools were adequate. Well, they weren't ....

The point there is that running the validation in PARALLEL with initial
shipping would have caught the error before the first chips had hit
real customers, and so Intel would merely have had to recall chips
from OEMs and the odd mad hacker.

Regards,
Nick Maclaren.

Philip Armstrong

unread,

Oct 31, 2003, 9:23:42 PM10/31/03

to

In article <bnue30$dra$1...@Pescadero.DSG.Stanford.EDU>,

Jonathan Stone <jona...@Pescadero.DSG.Stanford.EDU> wrote:
>In article <bntpdg$tdi$1...@news.rchland.ibm.com>,
>Del Cecchi <cec...@us.ibm.com> wrote:
>>But it does seem sort of strange that the same networking folks
>>worrying about broken hardware somehow corrupting their data blithely take
>>the lenght of the packet on blind faith and get caught by buffer overruns
>>time after time after time. Ping of Death and many others. Fool me 17
>>times, shame on me.
>Scuse me, what's a Ping of Death? Oh, you mean a Linux bug? Many
>"networking folks" would take offense with being tarred with that
>label. There's good reason why the BSD networking stack was the
>defacto standard in TCP/IP networking or nearly two decades.

If you're going to bring pointless OS advocacy into an otherwise
civilised discussion, at least make it *accurate* pointless OS
advocacy.

The 'Ping of Death' bug was in the BSD networking stack in the first
place. Which explains why it affected many (most?) of the operating
systems of the time. They'd all taken the BSD stack as their starting
point for their own networking code.

Phil

--
http://www.kantaka.co.uk/ .oOo. public key: http://www.kantaka.co.uk/gpg.txt

Stephen Fuld

unread,

Oct 31, 2003, 10:16:01 PM10/31/03

to

"Nick Maclaren" <nm...@cus.cam.ac.uk> wrote in message
news:bnucr6$do9$1...@pegasus.csx.cam.ac.uk...

> In article <Wryob.201371$0v4.15...@bgtnsc04-news.ops.worldnet.att.net>,
> Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:

snip

> >So you do end to end checksumming in software on all disk transfers? I
> >thought not. Why is that different?
>
> It isn't :-) The reasons why not are partly what you say above and
> partly because the I/O interfaces are very poorly designed for such
> robustness.

Why are they more poorly designed for such interfaces than a network
interface. You pointed out several ways that it could be done. These could
be added to the I/O handler without too much trouble (at least conceptually,
but certainly no more difficultpractically than doing a network protocol
stack.) Or any application is free to add whatever checksums it liked (even
including offsets) to its data. I have known applications (e.g. one sort
package I know of for its intermediate files) that did this. I think the
point isn't that it is hard, but that people just don't think it is worth
the overhead in general.

Stephen Fuld

unread,

Oct 31, 2003, 10:26:10 PM10/31/03

to

"Jonathan Stone" <jona...@Pescadero.DSG.Stanford.EDU> wrote in message

news:bnud2h$dna$1...@Pescadero.DSG.Stanford.EDU...

> In article <qiyob.201360$0v4.15...@bgtnsc04-news.ops.worldnet.att.net>,
> Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:
>
>
> >
> >AFAIK, virtually all but the highest end network cards do not check the
TCP
> >checksum at all. [...]
>
> Point of fact: $40 consumer-grade cards have TCP and IP-header checksum
> offload. Do you really think a $40 card counts as "the highest end"?

No. But my data on pricing of such cards was somewhat out of date (not all
that long ago, they were in the hundreds of dollars!). In any event, those
features on commodity level cards is a pretty recent event.

> At this point I should say that there's very little good empirical
> data on the actual rates of damaged TCP (or other) packets `in the
> wild' on real networks. What little data there is is described in a
> couple of papers by myself, my co-advisor and colleagues. The observed
> rates of damaged packets was disturbingly high. Detailed analysis of
> many of the damaged packets (comparing the damaged data to a later TCP
> retransmission)d0 yielded very strong evidence that errors were
> occurring between the sending TCP and the sending NIC: errors which
> will not be caught by an an outboard checksum engine.
>
> Anyone contemplating outboard TCP/IP checksum offload engines should
> at least be *aware* of those data.

Absolutely. So I presume those errors occurred either in the interface from
the NIC to the host memory or in the host software. If the first, it is
poorly designed hardware and should be fixed. Even given PCI and its
vaguaries, there is little excuse for that sort of sloppyness. The second
should also be fixed. As I said previously, I have absolutely no objection
to checking checksums in software as a debugging tool. Bot let me point out
again, that the same two sources of error would occur on a disk transfer
over a SCSI interface card, but no one talks about software checksumming of
disk I/O data.

snip

> IMNSO, one of the primary drivers for ToE engines is to implement
> iSCSI, using low-cost/embbeded CPUs, and to aid implementing iSCSI
> interfaces which provide the same interface to software as existing
> FC/SCSI HBA interfaces. The ill-considered choice of a link-level CRC
> as iSCSI application-level check have something to do with it, too.

Don't get me started on iSCSI. :-( I agree that offloading it is also a
big motivation for TOEs.

Stephen Fuld

unread,

Oct 31, 2003, 10:36:18 PM10/31/03

to

"Terje Mathisen" <terje.m...@hda.hydro.com> wrote in message

news:bnuhqh$p0k$1...@osl016lin.hda.hydro.com...

> Jonathan Stone wrote:
> > Contemporary server-class machines have PCI-X busses and are quite
> > capable of sinking multiple gigabit streams. Check out the Internet
> > Land Speed record numbers with the Intel Pro/10GbE cards, P4 Xeons,
> > Intel se7501 chipsets. If you have to touch the TCP data stream *at
> > all*, a folded copy/checksum has negligible marginal cost over a data
> > copy. That's been well-known since, oh, see several SIGCOMM papers in
> > the 1994-1995 time frame.
>
> I wrote/posted one of the first versions of this several years earlier,
> you can find it mentioned in one of Mike Abrash' x86 optimization books.
>
> I figured out that the inner loop can be written _very_ compactly on x86
> machines, at very close to maximum speed (this is from >10 year old
memory):

Snipped another of Terje's examples of truely elegant code.

Yes, but that sort of misses the point. The fact that you can "hide" the
overhead of the checksum in other overhead, while a tribute to your
abilities, ignores the fact that the additional overhead shouldn't be there
at all.

Stephen Fuld

unread,

Oct 31, 2003, 10:36:19 PM10/31/03

to

"Jonathan Stone" <jona...@Pescadero.DSG.Stanford.EDU> wrote in message

news:bnugd5$e3m$1...@Pescadero.DSG.Stanford.EDU...

> In article <bnsqal$qor$1...@gw.retro.com>,
> George William Herbert <gher...@gw.retro.com> wrote:
>
> >I objected specifically to forcing the CPU to do it rather
> >than assuming the network interface ASIC can do so properly.
>
> Point #1: The empirical evidence is that the NIC designers have a very
> poor track record of getting this completley right. Why do you
> persist in assuming that the NIC ASICs get it right when the evidence
> is that, historicaly they *don't* get it 100% right?

I think there is somewhat of a feedback loop going on here. The NIC guys
didn't put the effort into getting the NIC to host transfer right because
they knew the software guys would catch their problems (and probably
attribute them to network errors to boot!). Whears, the people who
developed SCSI adaptors knew that there would be no host checksum and that
they would get blamed, so spent the effort to get it right.

> (An analogy to NASA achieiving COTS is on the tip of my tongue.)
>
> Point #2: the empirical data shows fairly strongly that even if the
> NIC ASIC designers *were* to implement checksum offload 100%
> correctly, there are non-trivial sources of errors which occur in the
> path between the main memory and the NIC asic checksum engine. The
> NIC ASIC cannot protect against those errors: some (but not all)
> of those errors are in any case beyond their control.

OK, but again, why don't we do software checksums on SCSI disk I/O?

Jonathan Stone

unread,

Oct 31, 2003, 10:32:07 PM10/31/03

to

In article <bnujb1$sm4$1...@news.rchland.ibm.com>,
Del Cecchi <cec...@us.ibm.com> wrote:

>> Your comment is about as ill-founded as flaming "the same hardware people"
>> -- including you at IBM -- solely on the grounds of the Pentium FPU bug.
>
>Apologies for any offense. It was not attended. As a hardware person, I'm
>not trying to start a war here. Clearly there is fault on both sides when
>systems are unreliable.

Thank you for the gracious apology; gratefully accepted.

>I don't think the "ping of death" was strictly a Linux thing. It affected,

>as I recall, several unices perhaps including AIX. [..]

Perhaps I am misremembering (though I sincerely doubt it). Or perhaps
it depends exactly on what is meant by "ping of death". There were
large-ICMP attacks to which Linux was vulnerable, but which had been
fixed in BSD in the period when comp.bugs.4bsd.ucb-fixes was still
active. I cant speak direeclty to AIX networking code, as I have
little experience with it. How are AIX specfs numbers these days?

>So it wasn't a Linux vulnerability, but related to TCP/IP and related
>software in various OS. Why did you single out Linux?

I made an (incorrect) assumption that the ping-of-death was a
reference to Linux. Linux was the only with which I have direct
experience which was vulnerable to this (and other) attacks, as the
BSD networking stack plugged many such cases before Linux even existed.

>There have been and still seem to be a number of exploits related to
>receiving more data than was expected, in which the extra data is written
>into areas in which it isn't supposed to be. This is the "buffer over run"
>trick, right?

Yes. Right so far.

>And these packets normally are associated with networks,

Nope. Here we have a difference: maybe terminological, maybe deeper.
In my book, buffer overrun attacks in applciations have nothing to do
with _networking_ per se. To give a concrete example: it is totally
unreasonable to blame the TCP/IP research community (c.f. "same
networking people") for buffer overrun vulnerabilities in sendmail.

>hence my generic reference to "networking folks". Again, no intention to
>start a flame war here.

>Now you can explain why so much hardware/software associated with networking
>seems to be vulnerable to damage by unexpected data characteristics.

Del, with all due respect, I set myself a rather lower bar here. I
*will* step up to the bar of reporting empirically-observed *facts*
about observed data corruption in IP/TCP networks. (I can explain
some, but sreverse-engineering explanations for ASIC bugs by staring
at networking traces gets tiresome very quickly). Its hard to
summarize a long dissertation into a Usenet post, but here goes:

I can and will stand up to the bar of reporting *facts* about TCP
segments capured from Ethernets, from Ethernet frames with valid
Ethernet CRCs, but where the TCP payload fails the TCP checksum.

By assumption, if the Ethernet CRC is valid and reached the
packet-capture point, all Ethernet CRCs along the path were valid,
such damage must have occured either inside the end-host, between the
sending TCP buffer and the NIC; or inside switches or firewalls. From
comparing the damaged packets to subsequent TCP retransmissions, there
is very, very strong evidence that both kinds of errors occur.

For further details, consult the graphs and tables in my dissertation
(which has much new material not in the SIGCOMM 2000 paper already cited).

>As I recall the FP bug, the only thing notable about it was Intel trying to
>say it didn't matter. Pretty much all hardware has bugs. As does software.
>The real question is what one does about it in the future.

For the FP bug: sure. I mentioned it solely to illustrate exactly how
unfair and unwarranted the (apologized-for) comparison seemed to
myself, as someone with a networking background.

In the context of data networking: Do an truly end-to-end software
checksum (as suggested in: "End-to-end arguments in system design", J
H Saltzer, D P Reed, D D Clark, ACM Transactions on Computer Systems
2(4), 1984. (But note that for streaming or realtime applications,
such an end-to-end check may be too late).

For networking, the real issue here is: how much does an intermediate
transport-level checksum, like the TCP/IP checksum, buy you? And how
much of that gain is compromised by doing the TCP/IP checksum
outboard, in NIC hardware? The available evidence says, respectively:
enough to be worthwhile, and a fair proportion of the errors which
survive link-level CRC checks.

Stephen Fuld

unread,

Oct 31, 2003, 10:46:30 PM10/31/03

to

"Jonathan Stone" <jona...@Pescadero.DSG.Stanford.EDU> wrote in message

news:bnuish$eb6$1...@Pescadero.DSG.Stanford.EDU...

> In article <sByob.22691$Ec1.2...@bgtnsc05-news.ops.worldnet.att.net>,
> Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:
>
> >snipped the rest of an excellent presentation.
> >
> >I agree totally. I think that historically if TCP/IP had been designed
> >originally to be more "hardware friendly" such that even low cost and
easy
> >hardware did the checking as a matter of course, very few software
> >implementations would do the checksumming normally (though they might for
> >debugging purposes) and we wouldn't be having this argument.
>
> If you have a *genuine* interest in this topic, you really should take
> the time to read TCP/IP design documents like IEN-45 (e.g.,
> ftp.isi.edu:in-notes/ien/ien45.txt), and see what the IP/TCP designers
> *actually* thought about checksum offload.

I tried to download this but got an invalid path. Is there a typo or what
am i missing?

> Or you could ask someone who remembers having their files trashed in
> the 1980s as a consequence of NFS relying on an outboard checksum, and
> being hit by errors outside the coverage of that outboard checksum.

I'm not sure of your point here. If you are saying that NFS had some
problems, I won't disagree with that.

> Or you can ignore the historical record, and go on making up stuff to
> suit your own agenda. But denigrating the entire networking field
> based on such made-up stuff isn't going to win you many points.

I tried to make it clear in an earlier post (and will repeat here) that I do
not blame the TCP/IP designers given where we were when it was designed.

> > Look at things
> >like Fibre Channel or Infiniband, which were designed this way.
>
> And look carefully at their failure semantics (especially Inifinband),
> and look at who is deploying either one as general-purpose transport
> protocols. (Even ATM did better than that.)

Fibre channel is pretty successful in its area though it is certainly not
used for general networking. The issues with IB's commercial success or
lack thereof have been discussed here before several times. Let me just say
that IMNSHO whatever flaws its failure semantics may have are not one of the
principle reasons for its current status.

Jonathan Stone

unread,

Oct 31, 2003, 10:41:15 PM10/31/03

to

In article <uf0b71-...@trigger.kantaka.co.uk>,
Philip Armstrong <ph...@kantaka.co.uk> wrote:

>If you're going to bring pointless OS advocacy into an otherwise
>civilised discussion, at least make it *accurate* pointless OS
>advocacy.

First time I heard of a "ping of death", i tried it against an
up-to-date 4.4BSD-derived stack (NetBSD). As best I recall,
I had no problem.

I can't speak to vendor TCP/IP stacks in that period as I didn't have
privilege to inspect or fix them. As one datapoint: Ultrix 4.3A and
Ultrix 4.4, just a year or two before the "ping of death", shipped
with 4.2BSD UDP/TCP bugs: bugs which had been fixed y CSRG back in 1986.

Jonathan Stone

unread,

Oct 31, 2003, 10:49:13 PM10/31/03

to

In article <rbw*Sl...@news.chiark.greenend.org.uk>,
Thomas Womack <two...@chiark.greenend.org.uk> wrote:

>Migods! I knew that the interface was commodity [....]

[Much snipped]

> [...] ; so I'm amazed to discover that my normal supplier will sell

>me a five-port gigabit switch for 70 pounds, delivering tomorrow.
>
>I don't know whether the same will happen with 10GbE.

Amost certainly eventually it will, but perhaps not as quickly as with
gigabit ethernet over copper. The big price hurdle in current 10GbE
is the metropolitan-lan monomode optics. Expect rougnly a factor of 2
price drop in 10GbE, once NICs with 10GbE-CX4 (4x uprated Infiniband
links) starts shipping, followed by shallow price drops after the
market moves from "early adopter" to volume.

(Single-chip five-port 10GbE swiches with CX4 were announced Q3 2003,
but I understand they're barely sampling, not in volume till next year).

10GbE over RJ-45 is still years off. If one looks hard at desktop
(commodity) memory bandwidth curves, desktops should be able to handle
a reasonable fraction of 10Gbit/sec TCP/IP by then, in software.

Stephen Fuld

unread,

Oct 31, 2003, 10:56:57 PM10/31/03

to

"Greg Lindahl" <lin...@pbm.com> wrote in message
news:3fa2cab6$1...@news.meer.net...

> In article <AFyob.201388$0v4.15...@bgtnsc04-news.ops.worldnet.att.net>,
> Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:
>
> >You have just stated an argument that the software should add a checksum
to
> >all disk transfers. I presume you don't do that. Why not? You can have
> >the same kinds of errors that would discussed with a SCSI interface card
as
> >you have had with an Ethernet interface card.
>
> In fact, this is an important thing to think about. Why do high-end
> RAID vendors tend to only support certain firmware revs in disks?

Because they want to know exactly how the disk will respond in various
conditions and perhaps because they have unique requirements that the disk
vendor has implemented for them and are only in certain firmware levels.

> Well, the ECC generation and check in a disk system is implemented in
> disk firmware...

The ECC in a disk drive is implemented in dedicated hardware in the data
path. There is some firmware control to set it up, etc. and to properly
handle the hardware indicated errors.

> Were I designing a RAID controller, especially one destined to work
> with arbitrary disks, I would definitely add a higher level CRC.

Well, I was such a designer (except that we didn't allow arbitrary disks -
for the reasons I stated above) and we did add a higher level CRC. But
unfortunately, because we didn't control the host to controller interface,
it only applied within our controller (i.e. we stripped it off before
sending the data to the host). Most high end controllers do that. Most
lower level ones don't.

> And I
> would add enough cpu power to the controller to let it run 100% of the
> time.

Not CPU power, dedicated hardware. CRC calculation is one of those things
that dedicated hardware can do much more simply than a CPU. In a disk
controller, requiring the CPU to touch the data is a sure fire way to kill
performance. You want the controller's CPU to set up the transfers then get
out of the way till they are done.

Stephen Fuld

unread,

Oct 31, 2003, 11:07:07 PM10/31/03

to

"Jonathan Stone" <jona...@Pescadero.DSG.Stanford.EDU> wrote in message

news:bnubaf$dg0$1...@Pescadero.DSG.Stanford.EDU...
> In article <NSbob.199957$0v4.15...@bgtnsc04-news.ops.worldnet.att.net>,
> Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:
> >
> >The idea that a processor should be used to calculate a checksum is one
born
> >out of the peculiar blindness of some people with a network background
who
> >can't get beyond thinking that TCP/IP is "the" protocol. In most cases,
a
> >reasonably designed protocol will allow the checksum to be
> >calculated/checked in a modest piece of dedicated hardware as the data
flows
> >by on its way into or out of the chip.
>
> Stephen, I'm genuinely sorry if I sound offensive here, but your
> message is, at the very least, woefully under-informed about the state
> of the art in checksum offload. Your comments also paint a rather
> broad brush with what seems rather like contempt for networking
> people.

I will repeat again that I have no contempt for the designers of TCP/IP and
many (though certainly not all) network people are very good at what they
do. As for my ignorance of the current state of the art in checksum
offload, I plead guilty. The fundamental problem seems to me not to be in
the detail of where the checksumming is done, but in some of the fundamental
design issues of the underlying protocol. Once again, I don't blame the
original designers for that. They were doing good work given what was
available to them at the time. And the newer people have done a great job
of extending what was there in ways unthought of at the time.

> Yes, indeed hardwdare can compute checksums. But of course such a
> checksum covers only the path between that dedicated piece of
> hardware, and the similar piece of hardware at the next hop which
> verifies the checksum. It does not cover the path to and from main
> memory buffers, which is what networking people mean by "end-to-end".

> Data corruption errors can and do occur on that path. Outboard
> checksum hardware won't find them (modulo double-checksum/dobule-DMA
> tricks like IEN-45).

I understand what end-to-end means. But if the error occurs in say the
interface between the NIC and the host memory then why don't other cards
with the same type of host interface (i.e. SCSI cards) have the same types
of problems to the same degree. And if they do, why isn't there a big
clammor for software end-to-end checksumming of disk I/O data.

> Nevermind the well-known demonstrated fact that hardware weenies
> simply cannot get all the corner cases of TCP checksum offload right,
> at least not in the mass-market cards available on today's market.

Given that they can get the hardware checksumming right on other protocols,
does that say more about the hardware designers or the protocol? BTW, I am
not a hardware guy but your calling them "weenies" "seems rather like
contempt" for them :-)

Andy Isaacson

unread,

Oct 31, 2003, 11:07:32 PM10/31/03

to

Just to get this (almost) back on topic:

In article <aDBob.22900$Ec1.2...@bgtnsc05-news.ops.worldnet.att.net>,

Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:
>Fibre channel is pretty successful in its area though it is certainly not
>used for general networking.

Actually, there is some application of IP over FC. For example, it's
used in the Cray X1; the mainframe has SPC links (X1 special) out to
PCI-X drawers, which are filled with FC interface cards. Some of the FC
links are used to hit local disk, or for SAN; other FC links are used
to run IP over FC to talk to a router which talks GigE or whatever,
to the customer's network.

http://www.cray.com/craydoc/20/manuals/S-2346-23/html-S-2346-23/z1018025656.html
shows a block diagram, but doesn't explicitly say "IP over FC".

http://www.cray.com/craydoc/20/manuals/S-2348-23/html-S-2348-23/z1023137644fxb.html
has a more complex block diagram.

http://www.cray.com/craydoc/20/manuals/S-2348-23/html-S-2348-23/c1215.html
notes that the CNS performs routing duties.

-andy

Jonathan Stone

unread,

Oct 31, 2003, 11:08:23 PM10/31/03

to

In article <Qczob.8199$Uj7....@news.cpqcorp.net>,
Rick Jones <f...@bar.baz.invalid> wrote:
>Jonathan Stone <jona...@pescadero.dsg.stanford.edu> wrote:
>> (BTW, I agree with you 100% about placing checksums at the end of
>> packets. I wanted to do that for SCTP, but it was a long uphill
>> battle and was considered too much of a change).
>
>How many NICs (gbit and such) these days "stream" data to/from the
>host rather than store-and-forward it? Doesn't that make a difference
>as to the valueof a trailer checksum?

Hi Rick,

Wasn't iSCSI mentioned in the paragraph before your quote? If I
didn't, you have my immediate and abject apologies.

I expect iSCSI will be given disk-transfer-sized chunks. If you go the
typical vendor route and do the iSCSI layer in an offboard CPU/asic,
and present an HBA-like interface to the host, then the outboard CPU
can't [easily] begin sending the iSCSI PDU header, with the iICSI
[not-quite] end-to-end CRC until after it has read and digested and
CRC'ed all the subsequent data.

The headaches this causes (in comparison to atrailer checksum) for
anyone doing iSCSI-to-FC/SCSI bridges is left as an exercise for the
reader.

>At one point there was indeed a
>draft RFC for an option to TCP to allow trailer checksums (Subramian
>(sp) and Bridges IIRC) but it went nowhere in the IETF.

Dare I mutter "BSD trailer encapsulations" in public? Best not, not
even amongst friends.

Jonathan Stone

unread,

Oct 31, 2003, 11:01:07 PM10/31/03

to

In article <bnuin8$i23$1...@pegasus.csx.cam.ac.uk>,
Nick Maclaren <nm...@cus.cam.ac.uk> wrote:
>In article <rbw*Sl...@news.chiark.greenend.org.uk>,

[...]

>Watch out. With both Fast and Gigabit Ethernet, the first cards and
>switches were dire - typically little faster than the best of the
>previous generation.

Are you *sure*? If memory serves, the first Fast Ethernet NIC I could
buy was the 3c595: which suffered from inheriting a PIO model from the
(AT-era) 3c509. The first gigabit NIC was an Alteon Tigon, which was
good for 20 Mbyte/sec TCP throughput, out of the box (limited by the
200MHz P54C/EDO, which was still faster than the early Pentium-Pro
chipsets like the 440GX).

>Fibre is a LOT more civilised than copper. Remember that the Gigabit
>fibre was tested in HiPPI for quite a while before it hit Ethernet.
>We are actually using HiPPI cables for Gigabit Ethernet :-)

And fibre channel.

[...]

>There are, except that the latency is rather high, and very high
>indeed if you use TCP/IP. That doesn't always matter, but sometimes
>does.

Nope. More accurately: mediating TCP/IP in a kernel between the NIC
and userspace will give higher latency than dedicated cluster
interconnects which can do RDMA or otherwise deliver data directly to
a specific application-space buffer and notify a specific
application-space thread. There's no inherenet reason why the
back-end protocol between such NICs can' be IP/TCP. In fact there's
at least one vendor selling "VI" interconnects who do exactly that.

Jonathan Stone

unread,

Oct 31, 2003, 11:13:54 PM10/31/03

to

In article <BtBob.201566$0v4.15...@bgtnsc04-news.ops.worldnet.att.net>,
Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:
[pPosted and emailed.]

>Snipped another of Terje's examples of truely elegant code.
>
>Yes, but that sort of misses the point. The fact that you can "hide" the
>overhead of the checksum in other overhead, while a tribute to your
>abilities, ignores the fact that the additional overhead shouldn't be there
>at all.

I believe the overhead *should* be there, because the NIC vendors in
todays marketplace have demonstrably failed to implement IP/TCP
checksum offload 100% correctly, in all the niggly little corner cases
I've mentioned several times today already: Ethernet tingyrams,
fragmented UDP traffic, and so on.

By the way, I'm all *for* having TCP/IP checksum assist available.
But the sad truth is that the state of the art isn't up to where it
can be turned on, and software checksums turned off, by default.

Robert Myers

unread,

Oct 31, 2003, 11:25:54 PM10/31/03

to

On Fri, 31 Oct 2003 22:26:10 GMT, "Stephen Fuld"
<s.f...@PleaseRemove.att.net> wrote:

>Don't get me started on iSCSI. :-(

Oh, but please do get started on iSCSI, or at least offer a link.

RedStorm is a perfect example: one specialized high-cost chip per
(almost) commodity CPU. Obscenely expensive switch avoided by not
using a switched network, at the expense of heaven only knows what
difficulties. I'll have a look at the T3E experience, but I have a
feeling I'm not going to be told about all the times it didn't work
and they just moved the problem onto a computer that didn't have such
obviously serious communication constraints.

For the truly reckless among us, and those who have the variety of
options for recovery available to RedStorm (i.e., the worst that a
really bad network screwup causes is the need to restart from a check
point), there is hyperSCSI.

RM

Thomas Womack

unread,

Oct 31, 2003, 11:27:36 PM10/31/03

to

In article <bnuin8$i23$1...@pegasus.csx.cam.ac.uk>,
Nick Maclaren <nm...@cus.cam.ac.uk> wrote:

>Watch out. With both Fast and Gigabit Ethernet, the first cards and
>switches were dire - typically little faster than the best of the
>previous generation. It isn't impossible that a really cheap switch
>is one of the ones that manages to block badly even with 5 ports!

It wouldn't surprise me; this probably is an application where it makes
sense to buy the 150-pound switch rather than the 70-pound one. Though
I presume both are a single chip with five PHYs and a crossbar: 600MB/sec
doesn't feel like a tough thing to do on-chip in a technology that can
do the miraculous DSP for gigabit-over-copper.

>>I don't know whether the same will happen with 10GbE. Fibre doesn't
>>seem very civilised stuff to have random-users working with, and I
>>have trouble believing in 10Gb over copper. The 10Gb fibre interface
>>cards are four thousand pounds but reasonably widely available
>>(http://att.dealtime.com/xPO-Intel_PRO_10GbE_LR_Server_Adapter_PXLA8590LR);

>Fibre is a LOT more civilised than copper.

Ah. My immediate associations with fibre are "fragile, breaks if you
try to run it round too tight a corner" and "don't you have to polish
the ends and line them up to sub-micron tolerance under a microscope
to make connections"; are these in fact no longer problems? Is it
reasonable to run fibre across a living-room floor in the way that
every shared house of compsci students in the country does with CAT5?

The gigabit fibre NICs are not cheap, they seem to be UKP420 or so though
I have no idea if I'm looking at sensible suppliers -- the site with a
picture and decent specs was

http://www.midlandcomputers.co.uk/default.asp?area=prodview&prodid=2425.

I imagine it's not possible to make fibre patch-cables with a
thousand-foot reel of cable, a bag of ends, reasonable eyesight and a
crimping tool, though I'm pleased to see they're not ridiculously
expensive at somewhere like

http://www.cablemax.com/cablemax/fiber-optic-cables.cfm

Is networking with optic fibre a simple matter of substituting "CAT5
with RJ45 on each end" for "fibre pair with SC on each end", and the
RJ45 connector on the NIC for an SC one?

The fibre switches seem distinctly more expensive than the RJ45 ones,
http://www.lanshop.co.uk/gsm712f.htm is UKP1300 for eight fibre and
four copper gigabit ports, but again I'm not sure I'm not just making
a catastrophically bad choice of supplier.

Tom

Jonathan Stone

unread,

Oct 31, 2003, 11:33:18 PM10/31/03

to

In article <6kBob.22888$Ec1.2...@bgtnsc05-news.ops.worldnet.att.net>,
Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:

>Absolutely. So I presume those errors occurred either in the interface from
>the NIC to the host memory or in the host software. If the first, it is
>poorly designed hardware and should be fixed. Even given PCI and its
>vaguaries, there is little excuse for that sort of sloppyness.

Its a fact of life.

>The second should also be fixed.

But _today_ they *are not* fixed. So _today_, sane people[*] turn on
IP/TCP software checksums. Is any some part of that we disagree about?

[*] or at least sane people who aren't trying to start flamewars by
denigrating entire research subfields and inventing false histories,
histories which go wholly against the grain of the documentary record,
in order to support their ahistorical views. Have you read yet what Bill
Plummer said say about hardware checksum assist for TCP back in 1978?

>As I said previously, I have absolutely no objection
>to checking checksums in software as a debugging tool. Bot let me point out
>again, that the same two sources of error would occur on a disk transfer
>over a SCSI interface card, but no one talks about software checksumming of
>disk I/O data.

I can't speak well to that, as I havent spent years looking at SCSI
HBA errors. I have spent years resarching IP/TCP errors, and I have
spent much of the past coule of years writing device drivers to deal
with existing bugs in shipping, third- and fourth-generation NIC
IP/TCP offload. Oh yes, and reporting the error cases I discovered
back to the hardware vendors.

To repeat what I wrote earlier to another poster: most network
interface cards are lot burstier and illl-behaved than nearly any SCSI
interface cards, and their DMA pattenrs are a whole lot messier. (You
don't have to look at a bus analyzer trace to figure that out).

And the corner cases *inherenet the network protocols* --
fragmentation boundaries, different minimal packet lengths at
different layers, and so on -- are a whole lot nastier than anything
SCSI or even Fibre Channel ever has to deal with. It should hardly be
a suprise if NIC show bug patterns that SCSI or FC interfaces don't
show.

(That said: I've given talks on observed TCP errors to audiences with
a storage background, who will quietly and off-the-record admint that
once in a rare while, fibre channel will mess up and deliver the wrong
data to an HBA or to disk.)

Jonathan Stone

unread,

Oct 31, 2003, 11:39:46 PM10/31/03

to

In article <DtBob.201567$0v4.15...@bgtnsc04-news.ops.worldnet.att.net>,
Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:

>I think there is somewhat of a feedback loop going on here. The NIC guys
>didn't put the effort into getting the NIC to host transfer right because
>they knew the software guys would catch their problems (and probably
>attribute them to network errors to boot!).

I disagree. Good networking software guys wouldn't ever do that,
because they'd be aware that link-level CRCs will catch a staggering
majority of true link-level network errors.

>Whears, the people who
>developed SCSI adaptors knew that there would be no host checksum and that
>they would get blamed, so spent the effort to get it right.

If this was ever true, it is true no longer. It hasn't been true since
NIC vendors started shipping TCP/IP checksum assist and started
claiming that it would yield performance benefits. Alteon shipped
checksum offload on Tigon I and Tigon II what, five or six years ago now?

>OK, but again, why don't we do software checksums on SCSI disk I/O?

Answered ad nauseam elsewhere. Network protocols have nasty hairy
complicated corner cases. In SCSI, a disk block is a disk block is a
disk block, ususally 512 bytes and (excluding audio CDs) a
power-of-two multiple of 512 bytes otherwise.

George William Herbert

unread,

Oct 31, 2003, 11:47:44 PM10/31/03

to

Jonathan Stone <jona...@Pescadero.DSG.Stanford.EDU> wrote:
>George William Herbert <gher...@gw.retro.com> wrote:
>>I objected specifically to forcing the CPU to do it rather
>>than assuming the network interface ASIC can do so properly.
>
>Point #1: The empirical evidence is that the NIC designers have a very
>poor track record of getting this completley right. Why do you
>persist in assuming that the NIC ASICs get it right when the evidence
>is that, historicaly they *don't* get it 100% right?

I think you're missing the point.

In my experience, I have seen all of:
* NIC internal errors
* Link errors
* system bus errors
* Internal errors within RAM
* errors in CPUs
* errors in random I/O cards
* errors within the disk
* errors within RAID subsystems
* software errors leading to invalid DMAs
* ...

Arguments can be made for both improved hardware
and for CPU software checks (or, CPU dedicated check
hardware) for all of these cases.

The question is one of consistency, and one of balancing
the relative costs and benefits.

If the BER of a TCP link from the inside edge
of the sending network card to the inside edge of the
receiving network card is the same as that of a disk
subsystem to the same inside edge point, why would one
do TCP checksumming in the kernel but not embed checksums
in all disk blocks?

You're arguing that BERs on network links are higher
than commonly assumed, but I haven't seen evidence
posted that BERs on the boundary I defined are.

I have seen enough failures in the field and lab that
I know that all the types (and some I haven't mentioned,
and no doubt some I haven't thought of or heard of...)
will happen. You get bad parts; you get marginal parts;
you get defective designs, faulty QA, etc etc. People not
testing conditions well enough.

The design issue is, is the design balanced? Is the relative
effort taken to routinely handle each possible error vector
balanced with its proportional levels of occurrance and the
level of severity of it happening unnoticed?

Every single thing in a computer can be made redundant,
actively checksummed and corrected, cross checked, etc.
But the cost of doing so is high. Is the reliability
choice for TCP well balanced, or not?

I believe in TCP checksums. I am not convinced about
doing them in CPUs preferentially to in the ASICs.
Arguments that the ASICs are often faulty are refutable
by pointing to disks which are often faulty, memory which
is often faulty, etc.; we don't checksum those, in general,
inside the CPU.

>(An analogy to NASA achieiving COTS is on the tip of my tongue.)
>
>Point #2: the empirical data shows fairly strongly that even if the
>NIC ASIC designers *were* to implement checksum offload 100%
>correctly, there are non-trivial sources of errors which occur in the
>path between the main memory and the NIC asic checksum engine. The
>NIC ASIC cannot protect against those errors: some (but not all)
>of those errors are in any case beyond their control.

Those errors will happen roughly just as often between other I/O
and main memory, main memory and the CPUs, etc.

And do.

What about TCP/IP connections makes them deserving of a higher
level of system bus internal crosschecking than main memory
contents, data going to/from disk, etc?

Let me describe this as "Creeping Stratusism".
Mechanisms to protect against any arbitrary failure
are known and can be implimented. The question is,
have we documented sufficient reason to impliment such
checks and protections on a given level of system?
If one follows the argument that a component can
fail, then any component can fail, and everything
must be redundant. You get a VOS box. Which most
people consider too expensive and painful and won't
shell out for. So, quantify the errors so that we
justify the reliability choice. Are there documentable
errors inside systems, between the I/O card and memory,
which are higher rate for TCP/IP network interfaces
than for other I/O or other memory operations?
Are network cards uniquely bad in terms of properly
handling TCP/IP checksums, compared to disks handling
disk data checksums and block retrieval and the like?

Such arguments got us ECC on system buses and in RAM.
They have not yet argued effectively for checksumming
embedded into data on disk. TCP/IP checksums in the
CPU are legacy from before I/O cards could do it;
is the current state of the art actually such that
the ASICs checksumming are noticably worse than disk
controllers or video controllers or RAM behaviour?

-george william herbert
gher...@retro.com

Bill Todd

unread,

Oct 31, 2003, 11:51:17 PM10/31/03

to

"Stephen Fuld" <s.f...@PleaseRemove.att.net> wrote in message
news:vWBob.22918$Ec1.2...@bgtnsc05-news.ops.worldnet.att.net...

...

> I understand what end-to-end means. But if the error occurs in say the
> interface between the NIC and the host memory then why don't other cards
> with the same type of host interface (i.e. SCSI cards) have the same types
> of problems to the same degree. And if they do, why isn't there a big
> clammor for software end-to-end checksumming of disk I/O data.

As I said in my previous post on this subject, quite possibly because of the
dynamic nature of the relationship between decreasing hardware costs and
what the majority of people are willing to accept in terms of reliability.

The fact that you yourself just indicated that you included supplementary
CRC information in an array controller product you worked on is significant:
if *you* didn't find the round trip from disk to back to be sufficiently
reliable to make such added checks superfluous, then clearly similar
considerations would have applied to anyone accessing the same kinds of
disks without the benefit of an intervening (and duly paranoid) controller
to handle the problem for them - and yet, as you note, no common clamor for
placing such supplementary checks in software if the hardware doesn't
provide them seems to have (yet) surfaced.

The argument has already been advanced that the more complex nature of the
communication between a NIC and main memory than of that between a SCSI
controller and main memory may be part of the problem (though to me that
would suggest software/firmware issues rather than, say, hardware or
interface issues). Whatever the reason, the error rates seem to have been
demonstrated to be non-negligible (just as they are, in fact, for normal
disk access - though exactly *where* they creep in may not be clear).

- bill

Robert Myers

unread,

Nov 1, 2003, 12:03:43 AM11/1/03

to

On 31 Oct 2003 15:33:18 -0800, jona...@Pescadero.DSG.Stanford.EDU
(Jonathan Stone) wrote:

<snip>

>
>(That said: I've given talks on observed TCP errors to audiences with
>a storage background, who will quietly and off-the-record admint that
>once in a rare while, fibre channel will mess up and deliver the wrong
>data to an HBA or to disk.)

What are we going to do with all this computer power that's coming
down the road? Well, we'll just have to wait for Gates&Co to invent
some more buggy bloatware.

The idea that anything that has to do with my finances: my bank
account, my paycheck, my 401K, whatever, doesn't carry around some
guarantee of integrity that is checked every time the data is touched
just totally freaks me.

I would *rely* on a chain of hand-offs from an unknown series of
vendors with an unknown level of maintenance for any of these things?
You've got to be kidding.

I have the greatest of respect for hardware people, but there is
nothing quite like proof; e.g. the MD5 checksum is right, and at the
point of use. Why anybody would settle for less when so much is put
into "mission-critical" hardware is incomprehensible to me.

RM

Philip Armstrong

unread,

Oct 31, 2003, 11:19:04 PM10/31/03

to

In article <bnuoeb$em8$1...@Pescadero.DSG.Stanford.EDU>,

Jonathan Stone <jona...@Pescadero.DSG.Stanford.EDU> wrote:
>First time I heard of a "ping of death", i tried it against an
>up-to-date 4.4BSD-derived stack (NetBSD). As best I recall,
>I had no problem.

I believe the bug originated in the 4.3BSD stack (and earlier releases
perhaps?). Don't hold me to that claim though.

Jonathan Stone

unread,

Nov 1, 2003, 12:03:47 AM11/1/03

to

In article <aDBob.22900$Ec1.2...@bgtnsc05-news.ops.worldnet.att.net>,
Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:

>I tried to download this but got an invalid path. Is there a typo or what
>am i missing?

Odd, thats what I typed to my FTP client. I sent you a copy via email.

>> Or you could ask someone who remembers having their files trashed in
>> the 1980s as a consequence of NFS relying on an outboard checksum, and
>> being hit by errors outside the coverage of that outboard checksum.
>
>I'm not sure of your point here. If you are saying that NFS had some
>problems, I won't disagree with that.

The point is not that NFS ``had problems''. The point is that Sun
Microsystems (and other contemporary vendors) performed an experiment
on their customers. By disabling the UDP checksum, they gave their
customers who used NFS over single-hop coaxial ethernet a direct and
exact su of what'd happen if their customers had used NICs with
outboard UDP checksumming. (Here the Ethernet CRC takes the place of
the hypothetical outboard UDP checksum). The consequnces are still
remebered bitterly today.

>I tried to make it clear in an earlier post (and will repeat here) that I do
>not blame the TCP/IP designers given where we were when it was designed.

Stephen, I submit (with respect), that you have hostiley placed words
in their mouths, words which are clearly contradicted by both the
documentary record from the time, and by subsequent conversations with
"the TCP/IP designers" like Bob Braden. In fact the exact history of
the TCP/IP checksum design was reconstructed and documented, in large
part in response to the payer by I and Craig Partridge cited earlier.

I'm at a loss for words as to why you are doing that. I can only
repeat that it's not going to go over well to anyone with a networking
background. Indeed, "crackpot" is not far off the mark.

>Fibre channel is pretty successful in its area though it is certainly not
>used for general networking. The issues with IB's commercial success or

>lack thereof have been discussed here before several times. [...]

I haven't seen it here, not by people with a good networking
background, not from the perspective of Infiniband as any kind of
general-purpose transport.

Jonathan Stone

unread,

Nov 1, 2003, 12:14:21 AM11/1/03

to

In article <Q%zob.8203$5y7...@news.cpqcorp.net>,
Rick Jones <f...@bar.baz.invalid> wrote:

>That still doesn't mean I'm uncomfortable enough with CKO to say to
>toss it - I have at least as much faith in a CKO NIC as I do in say,
>a RAID controller... :)

Before enabling checksum offload and disabling software checksums,
for any data you care about -- experimental data, dissertation,
data-center, financial records, that kind of stuff- I would recommend
you check very carefully whether your NIC gets checksum offload
correct in at least the following circumstances:

1. In 802.1Q-tagged VLAN packets, if you are using VLAN tagging.

2. For transmitted IP/TCP packets shorter than the 60-byte Ethernet minimum.

3. For received IP/TCP packets shorter than the 60-byte Ethernet minimum.

4. For outbound fragmented UDP traffic, if you are using NFS over UDP
or other fragmented UDP traffic.

5. For inbound fragmented UDP, as above.

6. For a broad spectrum of normal Ethernet payload sizes, including
odd-size packets.
7. If using jumbo frames, repeat for jumbo-frame packets.

8. Carefully check the NIC vendor errata and the PCI (or other bus)
host-bridge (or bus-bridge, if used) vendor errata for any deviations
from the spec which either chip can trigger in the other.
The shorter and non-anilgned DMA requests from NICs has been known
to trigger such errors when SCSI or FC adaptors did not.

I just loved not being able to SSH into machines in the lab when the
vendor changed the chip stepping under the device driver which (at
that point) had decided that checksum offload was "safe"...

Jonathan Stone

unread,

Nov 1, 2003, 12:19:30 AM11/1/03

to

In article <UCyob.8196$ks7....@news.cpqcorp.net>,
Rick Jones <f...@bar.baz.invalid> wrote:

>The "tensions" between the need/desire for a "zero-copy" solutuion and
>strict adherence to memory-to-memory end-to-end in networking are of
>course, a very firtile ground for discussions :)

Sure. Its no accident that the RDMA "network" hardware built at
Stanford DSG around the time you took CS244 was actually moving
cachelines from one CPU module to another or to switches. The followon
PCI board had to settle for PCI DMA. I look forward to playing with
gig-e chips that can spit Ethernet frames to and from FIFOs on
MIPS-style CPU cores, which parse the headers and deliver frames to
appropriate addresses on a HyperTransport link.

Except I dont belive the mips cores cant keep up with 10gE yet...

>Now, if only those pesky hardware guys could get memory keeping-up
>with processors maybe we wouldn't have to worry about it anymore... :)

Its cyclic: ethernet speed goes up in factors of 10 every few years
whereas memory bandwidth goes up in quicker, shorter steps.

Jonathan Stone

unread,

Nov 1, 2003, 12:36:30 AM11/1/03

to

In article <irKdnYBhFYD...@metrocast.net>,
Bill Todd <bill...@metrocast.net> wrote:

>The argument has already been advanced that the more complex nature of the
>communication between a NIC and main memory than of that between a SCSI
>controller and main memory may be part of the problem (though to me that
>would suggest software/firmware issues rather than, say, hardware or
>interface issues).

Sure. But from own perspective, metaphorically sitting firmly in
the TCP/IP stack, that doen't matter one iota. A software checksum
would catch *all* those errors (whether the error is in the device
driver software, NIC firmware, or an out-and-out bus interface bug).

In contrast, an NIC outboard implementation of the IP/TCP/UDP checksum
isn't ever going to catch errors which happen between itself and the
TCP buffer in main memory buffer.

>Whatever the reason, the error rates seem to have been
>demonstrated to be non-negligible (just as they are, in fact, for normal
>disk access - though exactly *where* they creep in may not be clear).

Sometimes it is clear, sometimes it isn't.

Jonathan Stone

unread,

Nov 1, 2003, 12:31:00 AM11/1/03

to

In article <vWBob.22918$Ec1.2...@bgtnsc05-news.ops.worldnet.att.net>,
Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:

Stephen,

I say again: you are making unfair and unwarranted assumptions aboutt
the degree of insight on the part of "TCP/IP designers", what they
designed at the time (and why they did so) and what they'd say now. I
can say that with very strong assurance, as (as part of the
introduction to my doctoral dissertation) I researched and wrote up
much of historical design process for the TCP/IP checksum: both from
the documentary record and by asking the individuals concerned for
their recollections and (where possible) personal notes and email.

>I understand what end-to-end means. But if the error occurs in say the
>interface between the NIC and the host memory then why don't other cards
>with the same type of host interface (i.e. SCSI cards) have the same types
>of problems to the same degree. And if they do, why isn't there a big
>clammor for software end-to-end checksumming of disk I/O data.

I ask again: How many SCSI cards can you name that do DMA to
odd-byte-length buffers? Or that are deliberatrley programmed to DMA
data into non-32-bit-aligned buffers? Or that have to deal with
fragmentation and reassembly across non-power-of-two boundaries, on
both sides of the layer boundary? I dont know of any, do you?

Your argument is by analogy, and the analogy just stand up to more
than cursory examination.

>> Nevermind the well-known demonstrated fact that hardware weenies
>> simply cannot get all the corner cases of TCP checksum offload right,
>> at least not in the mass-market cards available on today's market.
>
>Given that they can get the hardware checksumming right on other protocols,
>does that say more about the hardware designers or the protocol? BTW, I am
>not a hardware guy but your calling them "weenies" "seems rather like
>contempt" for them :-)

It was a conscious echo of a derogatory tone toward "networking
people" which I noted in your messages. (well, okay, that combined
with your blatantly incorrect assumptions about the historical
record). If that tone wasn't intended, then I apologise, both to you
and to all the hardare guys in comp.arch.

Jonathan Stone

unread,

Nov 1, 2003, 1:01:19 AM11/1/03

to

In article <bnusb0$5cd$1...@gw.retro.com>,

George William Herbert <gher...@gw.retro.com> wrote:

>I think you're missing the point.

Maybe we're talking past each other. You seem to be missing my point,
as you demonstrate below:

>You're arguing that BERs on network links are higher
>than commonly assumed,

'm not arguing anything, I'm alluding to data published in the
peer-reviewed literature. And I'm not saying *anything* at all about
errors on network links: the errors I looked at had survived the
Ethernet CRC. (BW, I have read the orignial design docs for that CRC).

The kinds of errors I'm talking about happend due to bad bits in
router or switch buffers (by correlating single-bit errors in multiple
TCP flows, I pinpointed a liekly persistent bad bit at a particular
byte offset in one Berkeley router). Or, errors which occured within
the sending TCP/IP host, after software computed the TCP checksum but
before the packet hit the outbound Ethernet CRC harwdare.

Some of those errors were pinpointed on an old non-switched thinwire
ethernet. The likelihood of errors in a swich or router is therefore
vanishingly small.

>I believe in TCP checksums. I am not convinced about
>doing them in CPUs preferentially to in the ASICs.

>Those errors will happen roughly just as often between other I/O
>and main memory, main memory and the CPUs, etc.

Nope. Empirically, if a particular TCP packet triggers a NIC checksum
error, the same packet (or retransmissions of the packet, or in a
relaly bad case, any equal-length packet will be wrong *every time*.

If you want a full appreciation of the issue, install an open-source
OS on a server, install one of the NICs with bugy checksum offload,
edit the device driver to re-enable checksum offload, rebuild a kernel
and reboot the server. If you pick the right [wrong] card, you've got
maybe an hour before you trigger the bug. At that point *every*
outbound IP/TCP ACK-only segment from the server will have a bad
checksum. Even rebooting the server won't help; you have to cycle A/C
power to restore the NIC to a working state. And (given sufficent
insight and tools) one can trigger the problen again in minutes.

[... long snip about Stratusism]

>is the current state of the art actually such that
>the ASICs checksumming are noticably worse than disk
>controllers or video controllers or RAM behaviour?

Yes. Demonstrably, and by several orders of magnitude, in a vareity
of different corner cases. I havent yet found a vendor I'm 100% happy
with.

Jonathan Stone

unread,

Nov 1, 2003, 1:32:27 AM11/1/03

to

In article <vWBob.22918$Ec1.2...@bgtnsc05-news.ops.worldnet.att.net>,
Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:
...

To clarify why I took such exception to Stephen Fuld's message about
the TCP/IP designers:

In point of fact IEN-45 (W. W Plummer, 1978) explicitly addresses the
idea of a TCP offload engine (ToE) and devotes section 5 to a discussion
of ToEs and the IP/TCP checksum -- though for historical reasons,
the discussion is couched in terms of a front-end minicomputer implementing
TCP, and sending TCP letters to and from a host (DEC-10 era mainframe).

To quote IEN-45 section 5:

The problem with attempting to validate checksums in the frontend
is that it destroys the end-to-end character of the checksum. If
anything, this is the most powerful feature of the TCP checksum!

IEN-45 goes on to sketch a double-DMA protocol, where the ToE
(frontend) would DMA verifiedd TCP payloads to the end-host and keep
its copy of the data; then ask the host to DMA the TCP payload back to
the frontend, whereupon the frontend would compare what it saved, to
what the hos returned.

Compare that with Stephen's unsupported assertions about what the
TCP/IP designers thought, and draw your own conclusions.

Stephen Fuld

unread,

Nov 1, 2003, 8:00:37 AM11/1/03

to

"Jonathan Stone" <jona...@Pescadero.DSG.Stanford.EDU> wrote in message

news:bnurfu$f01$1...@Pescadero.DSG.Stanford.EDU...

> In article <6kBob.22888$Ec1.2...@bgtnsc05-news.ops.worldnet.att.net>,
> Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:
>
>
> >Absolutely. So I presume those errors occurred either in the interface
from
> >the NIC to the host memory or in the host software. If the first, it is
> >poorly designed hardware and should be fixed. Even given PCI and its
> >vaguaries, there is little excuse for that sort of sloppyness.
>
> Its a fact of life.

But it is an interesting issue to look at why NICs seem subject to these
kinds of errors much more so than say SCSI cards that go to the same kind of
PCI bus. You have offerred two reasons:

1. Burstyness of transfers. I don't think this is an issue as the
burstyness of the network (collisions etc.) is hidden from the host side of
the NIC and by the time you get to the actual transfer across the interface,
I don't think it is any "burstier" than disk traffic. Think 15 ms with
nothing (seek plus latency) followed by say 64K (pick your favorite transfer
size) in a burst.

2. Various and changing DMA sizes. A real hardware guy would be better
at answering this than me, but again, I don't see that, by the time you get
to the host side of the card, that should be a problem.

Never the less, such problems apparently do occur with more frequency on
NICs than SCSI cards, so there must be an explanation. My proposed
explanation is (as I stated in another post), that the people who designed
the NICs "knew" that the checksum would be done in the software and any
checksum errors were unlikely to be blamed on them anyway, so didn't take
the care needed to get this right as much as the SCSI card designers did.

> >The second should also be fixed.
>
> But _today_ they *are not* fixed. So _today_, sane people[*] turn on
> IP/TCP software checksums. Is any some part of that we disagree about?

Only theoretically. I agree that with things the way they are today,
software checksumming is important, and the extra overhead can be hidden (as
Terje has shown) in other areas. My argument is that if the protocol were
designed to be more hardware friendly, that overhead wouldn't be there to
hide behind and the incorporation of things like checksumming in hardware
would be a big enough savings that people would fix their cards and their
software such that things would be better. And by the way we would have
world peace and freedom from hunger too. :-)

> [*] or at least sane people who aren't trying to start flamewars by
> denigrating entire research subfields and inventing false histories,
> histories which go wholly against the grain of the documentary record,
> in order to support their ahistorical views. Have you read yet what Bill
> Plummer said say about hardware checksum assist for TCP back in 1978?

I have repeated more times than I should have that I was not denegrating
entire research fields, insulting people, etc. Yet you seem to persist in
thinking I am doing so and even insult me. As for the article, I tried to
download it and couldn't and noted that on a post. Thank you for sending me
(by private e-mail) the information. I haven't yet read it, but will do so,
probably tomorow.

> >As I said previously, I have absolutely no objection
> >to checking checksums in software as a debugging tool. Bot let me point
out
> >again, that the same two sources of error would occur on a disk transfer
> >over a SCSI interface card, but no one talks about software checksumming
of
> >disk I/O data.
>
> I can't speak well to that, as I havent spent years looking at SCSI
> HBA errors. I have spent years resarching IP/TCP errors, and I have
> spent much of the past coule of years writing device drivers to deal
> with existing bugs in shipping, third- and fourth-generation NIC
> IP/TCP offload. Oh yes, and reporting the error cases I discovered
> back to the hardware vendors.

But do you at least agree that it is an interesting question to ask, and
pursuing it might lead to some insight which could help enlighten people?
At this point, that is all I ask.

> To repeat what I wrote earlier to another poster: most network
> interface cards are lot burstier and illl-behaved than nearly any SCSI
> interface cards, and their DMA pattenrs are a whole lot messier. (You
> don't have to look at a bus analyzer trace to figure that out).

See my comments on that above.

> And the corner cases *inherenet the network protocols* --
> fragmentation boundaries, different minimal packet lengths at
> different layers, and so on -- are a whole lot nastier than anything
> SCSI or even Fibre Channel ever has to deal with. It should hardly be
> a suprise if NIC show bug patterns that SCSI or FC interfaces don't
> show.

Yes, I agree with that. But it isn't clear to me why that should effect the
reliability of the NIC to host transfer hardware.

> (That said: I've given talks on observed TCP errors to audiences with
> a storage background, who will quietly and off-the-record admint that
> once in a rare while, fibre channel will mess up and deliver the wrong
> data to an HBA or to disk.)

Sure, it happens occasionally. As Del said, all hardware (and software) has
bugs and breaks occasionally. But the issue remains as to why NICs seem
much more "error prone". I still haven't seen a satisfactory answer to
that.

Stephen Fuld

unread,

Nov 1, 2003, 8:10:44 AM11/1/03

to

"Robert Myers" <rmy...@rustuck.com> wrote in message
news:dmt5qv0qe0j2bt7lj...@4ax.com...

snip

> I have the greatest of respect for hardware people, but there is
> nothing quite like proof; e.g. the MD5 checksum is right, and at the
> point of use. Why anybody would settle for less when so much is put
> into "mission-critical" hardware is incomprehensible to me.

But you are concentrating on only part of the problem. Unless you are using
a mainframe (IBM or Unisys, or perhaps some others) or a duplexed system
like Tandem, you pretty much have no assurance that when you execute an
instruction to add one to a register, it really didn't add two. (Of course
this is merely one example.) Now say that error occurs once every few
billion or so instructions (i.e. once an hour or so). That is one reason
why most large financial institutions (banks, insurance companies,
brokerages, etc.) keep customer records on mainframes.

Stephen Fuld

unread,

Nov 1, 2003, 8:20:51 AM11/1/03

to

"Jonathan Stone" <jona...@Pescadero.DSG.Stanford.EDU> wrote in message

news:bnurs2$f17$1...@Pescadero.DSG.Stanford.EDU...

> In article <DtBob.201567$0v4.15...@bgtnsc04-news.ops.worldnet.att.net>,
> Stephen Fuld <s.f...@PleaseRemove.att.net> wrote:

snip

> >Whears, the people who
> >developed SCSI adaptors knew that there would be no host checksum and
that
> >they would get blamed, so spent the effort to get it right.
>
> If this was ever true, it is true no longer. It hasn't been true since
> NIC vendors started shipping TCP/IP checksum assist and started
> claiming that it would yield performance benefits. Alteon shipped
> checksum offload on Tigon I and Tigon II what, five or six years ago now?

Your knowledge in that area far exceeds mine. But have NICs with checksums
been in common use that long? I guess I knew they were available back then,
but I didn't think they were common.

>
> >OK, but again, why don't we do software checksums on SCSI disk I/O?
>
> Answered ad nauseam elsewhere. Network protocols have nasty hairy
> complicated corner cases. In SCSI, a disk block is a disk block is a
> disk block, ususally 512 bytes and (excluding audio CDs) a
> power-of-two multiple of 512 bytes otherwise.

While frequently true, it is not true often enough to make your statement
wrong. For example, IBM AS/400s (i series??) use 520 bytes. Various models
of Unisys mainframe systems use 180 and 504 byte sectors. Many high end
storage controllers use other sizes since they add additional information,
either control information or internal checksums to the basic data block.

Stephen Fuld

unread,

Nov 1, 2003, 8:41:03 AM11/1/03

to

"Bill Todd" <bill...@metrocast.net> wrote in message
news:irKdnYBhFYD...@metrocast.net...

>
> "Stephen Fuld" <s.f...@PleaseRemove.att.net> wrote in message
> news:vWBob.22918$Ec1.2...@bgtnsc05-news.ops.worldnet.att.net...
>
> ...
>
> > I understand what end-to-end means. But if the error occurs in say the
> > interface between the NIC and the host memory then why don't other cards
> > with the same type of host interface (i.e. SCSI cards) have the same
types
> > of problems to the same degree. And if they do, why isn't there a big
> > clammor for software end-to-end checksumming of disk I/O data.
>
> As I said in my previous post on this subject, quite possibly because of
the
> dynamic nature of the relationship between decreasing hardware costs and
> what the majority of people are willing to accept in terms of reliability.

I think that is true at one level, but going deeper, we get to the question
of why the card to host transfer is inherently less reliable in a NIC tahn
in a SCSI card. I know that Jonathon has posited two reasons for this, and
I have responded that I don't think they are right. Unfortunatly the nature
of usenet means that we are frequently responding to multiple sub-threads
and have posts "crossing in the night", so that is still getting worked out.

> The fact that you yourself just indicated that you included supplementary
> CRC information in an array controller product you worked on is
significant:
> if *you* didn't find the round trip from disk to back to be sufficiently
> reliable to make such added checks superfluous, then clearly similar
> considerations would have applied to anyone accessing the same kinds of
> disks without the benefit of an intervening (and duly paranoid) controller
> to handle the problem for them - and yet, as you note, no common clamor
for
> placing such supplementary checks in software if the hardware doesn't
> provide them seems to have (yet) surfaced.

The system I was talking about was for attachemnt to an IBM mainframe type
system. They tend to have "tighter" requirements than most other such
systems. AFAIK, all controllers for IBM mainframes do something like what
we did, even using the same disks as are used without such features on
typical high end servers.

> The argument has already been advanced that the more complex nature of the
> communication between a NIC and main memory than of that between a SCSI
> controller and main memory may be part of the problem (though to me that
> would suggest software/firmware issues rather than, say, hardware or
> interface issues).

Yes. And in a simple NIC, I don't know how much firmware there is on card,
but I guess not much, if any. There is still something I don't get here.

> Whatever the reason, the error rates seem to have been
> demonstrated to be non-negligible

Agreed.

> (just as they are, in fact, for normal
> disk access - though exactly *where* they creep in may not be clear).

I think the point that Jonathon was making is that they are greater for NICs
than for disk accesses. I will accept that as it seems to be a widely held
belief. Assuming it is true, I still don't understand why.