Thanks
Craig
BBoard-ID: 7621
BB-Posted: Tue, 25 Oct 88 2:06:08 EDT
To: tcp...@sri-nic.ARPA
Subject: 4BSD TCP Ethernet Throughput
Date: Mon, 24 Oct 88 13:33:13 PDT
From: Van Jacobson <v...@helios.ee.lbl.gov>
Many people have asked for the Ethernet throughput data I
showed at Interop so it's probably easier to post it:
These are some throughput results for an experimental version of
the 4BSD (Berkeley Unix) network code running on a couple of
different MC68020-based systems: Sun 3/60s (20MHz 68020 with AMD
LANCE Ethernet chip) and Sun 3/280s (25MHz 68020 with Intel
82586 Ethernet chip) [note again the tests were done with Sun
hardware but not Sun software -- I'm running 4.?BSD, not Sun
OS]. There are lots and lots of interesting things in the data
but the one thing that seems to have attracted people's
attention is the big difference in performance between the two
Ethernet chips.
The test measured task-to-task data throughput over a TCP
connection from a source (e.g., chargen) to a sink (e.g.,
discard). The tests were done between 2am and 6am on a fairly
quiet Ethernet (~100Kb/s average background traffic). The
packets were all maximum size (1538 bytes on the wire or 1460
bytes of user data per packet). The free parameters for the
tests were the sender and receiver socket buffer sizes (which
control the amount of 'pipelining' possible between the sender,
wire and receiver). Each buffer size was independently varied
from 1 to 17 packets in 1 packet steps. Four tests were done at
each of the 289 combinations. Each test transferred 8MB of data
then recorded the total time for the transfer and the send and
receive socket buffer sizes (8MB was chosen so that the worst
case error due to the system clock resolution was ~.1% -- 10ms
in 10sec). The 1,156 tests per machine pair were done in random
order to prevent any effects from fixed patterns of resource
allocation.
In general, the maximum throughput was observed when the sender
buffer equaled the receiver buffer (the reason why is complicated
but has to do with collisions). The following table gives the
task-to-task data throughput (in KBytes/sec) and throughput on
the wire (in MBits/sec) for (a) a 3/60 sending to a 3/60 and
(b) a 3/280 sending to a 3/60.
_________________________________________________
| 3/60 to 3/60 | 3/280 to 3/60 |
| (LANCE to LANCE) | (Intel to LANCE) |
| socket | |
| buffer task to | task to |
| size task wire | task wire |
|(packets) (KB/s) (Mb/s) | (KB/s) (Mb/s) |
| 1 384 3.4 | 337 3.0 |
| 2 606 5.4 | 575 5.1 |
| 3 690 6.1 | 595 5.3 |
| 4 784 6.9 | 709 6.3 |
| 5 866 7.7 | 712 6.3 |
| 6 904 8.0 | 708 6.3 |
| 7 946 8.4 | 710 6.3 |
| 8 954 8.4 | 718 6.4 |
| 9 974 8.6 | 715 6.3 |
| 10 983 8.7 | 712 6.3 |
| 11 995 8.8 | 714 6.3 |
| 12 1001 8.9 | 715 6.3 |
|_____________________________|__________________|
The theoretical maximum data throughput, after you take into
account all the protocol overheads, is 1,104 KB/s (this
task-to-task data rate would put 10Mb/s on the wire). You can
see that the 3/60s get 91% of the the theoretical max. The
3/280, although a much faster processor (the CPU performance is
really dominated by the speed of the memory system, not the
processor clock rate, and the memory system in the 3/280 is
almost twice the speed of the 3/60), gets only 65% of
theoretical max.
The low throughput of the 3/280 seems to be entirely due to the
Intel Ethernet chip: at around 6Mb/s, it saturates. (I put the
board on an extender and watched the bus handshake lines on the
82586 to see if the chip or the Sun interface logic was pooping
out. It was the chip -- it just stopped asking for data. (The
CPU was loafing along with at least 35% idle time during all
these tests so it wasn't the limit).
[Just so you don't get confused: Stuff above was measurements.
Stuff below includes opinions and interpretation and should
be viewed with appropriate suspicion.]
If you graph the above, you'll see a large notch in the Intel
data at 3 packets. This is probably a clue to why it's dying:
TCP delivers one ack for every two data packets. At a buffer
size of three packets, the collision rate increases dramatically
since the sender's third packet will collide with the receiver's
ack for the previous two packets (for buffer sizes of 1 and 2,
there are effectively no collisions). My suspicion is that the
Intel is taking a long time to recover from collisions (remember
that you're 64 bytes into the packet when you find out you've
collided so the chip bus logic has to back up 64 bytes -- Intel
spent their silicon making the chip "programmable", I doubt they
invested as much as AMD in the bus interface). This may or may
not be what's going on: life is too short to spend debugging
Intel parts so I really don't care to investigate further.
The one annoyance in all this is that Sun puts the fast Ethernet
chip (the AMD LANCE) in their slow machines (3/50s and 3/60s)
and the slow Ethernet chip (Intel 82586) in their fast machines
(3/180s, 3/280s and Sun-4s, i.e., all their file servers).
[I've had to put delay loops in the Ethernet driver on the 3/50s
and 3/60s to slow them down enough for the 3/280 server to keep
up.] Sun's not to blame for anything here: It costs a lot
to design a new Ethernet interface; they had a design for the
3/180 board set (which was the basis of all the other VME
machines--the [34]/280 and [34]/110); and no market pressure to
change it. If they hadn't ventured out in a new direction with
the 3/[56]0 -- the LANCE -- I probably would have thought
700KB/s was great Ethernet throughput (at least until I saw
Dave Boggs' DEC-Titan/Seeq-chip throughput data).
But I think Sun is overdue in offering a high-performance VME
Ethernet interface. That may change though -- VME controllers
like the Interphase 4207 Eagle are starting to appear which
should either put pressure on Sun and/or offer a high
performance 3rd party alternative (I haven't actually tried an
Eagle yet but from the documentation it looks like they did a
lot of things right). I'd sure like to take the delay loops out
of my LANCE driver...
- Van
ps: I have data for Intel-to-Intel and LANCE-to-Intel as well as
the Intel-to-LANCE I listed above. Using an Intel chip on the
receiver, the results are MUCH worse -- 420KB/s max. I chose
the data that put the 82586 in its very best light.
I also have scope pictures taken at the transceivers during all
these tests. I'm sure there'll be a chorus of "so-and-so violates
the Ethernet spec" but that's a lie -- NONE OF THESE CHIPS OR
SYSTEMS VIOLATED THE ETHERNET SPEC IN ANY WAY, SHAPE OR FORM.
I looked very carefully for violations and have the pictures to
prove there were none.
Finally, all of the above is Copyright (c) 1988 by Van Jacobson.
If you want to reproduce any part of it in print, you damn well
better ask me first -- I'm getting tired of being misquoted in
trade rags.
::::::::::::::
Received: from venera.isi.edu by NNSC.NSF.NET id aa06169; 30 Apr 90 4:43 EDT
Posted-Date: Mon, 30 Apr 90 01:40:59 PDT
Received: from vs.ee.lbl.gov by venera.isi.edu (5.61/5.61+local)
id <AA05073>; Mon, 30 Apr 90 01:39:43 -0700
Received: by helios.ee.lbl.gov (5.61/1.39)
id AA26743; Mon, 30 Apr 90 01:40:59 -0700
Message-Id: <900430084...@helios.ee.lbl.gov>
To: end2end-...@venera.isi.edu
Subject: modified TCP congestion avoidance algorithm
Date: Mon, 30 Apr 90 01:40:59 PDT
From: Van Jacobson <v...@helios.ee.lbl.gov>
This is a description of the modified TCP congestion avoidance
algorithm that I promised at the teleconference.
BTW, on re-reading, I noticed there were several errors in
Lixia's note besides the problem I noted at the teleconference.
I don't know whether that's because I mis-communicated the
algorithm at dinner (as I recall, I'd had some wine) or because
she's convinced that TCP is ultimately irrelevant :). Either
way, you will probably be disappointed if you experiment with
what's in that note.
First, I should point out once again that there are two
completely independent window adjustment algorithms running in
the sender: Slow-start is run when the pipe is empty (i.e.,
when first starting or re-starting after a timeout). Its goal
is to get the "ack clock" started so packets will be metered
into the network at a reasonable rate. The other algorithm,
congestion avoidance, is run any time *but* when (re-)starting
and is responsible for estimating the (dynamically varying)
pipesize. You will cause yourself, or me, no end of confusion
if you lump these separate algorithms (as Lixia's message did).
The modifications described here are only to the congestion
avoidance algorithm, not to slow-start, and they are intended to
apply to large bandwidth-delay product paths (though they don't
do any harm on other paths). Remember that with regular TCP (or
with slow-start/c-a TCP), throughput really starts to go to hell
when the probability of packet loss is on the order of the
bandwidth-delay product. E.g., you might expect a 1% packet
loss rate to translate into a 1% lower throughput but for, say,
a TCP connection with a 100 packet b-d p. (= window), it results
in a 50-75% throughput loss. To make TCP effective on fat
pipes, it would be nice if throughput degraded only as function
of loss probability rather than as the product of the loss
probabilty and the b-d p. (Assuming, of course, that we can do
this without sacrificing congestion avoidance.)
These mods do two things: (1) prevent the pipe from going empty
after a loss (if the pipe doesn't go empty, you won't have to
waste round-trip times re-filling it) and (2) correctly account
for the amount of data actually in the pipe (since that's what
congestion avoidance is supposed to be estimating and adapting to).
For (1), remember that we use a packet loss as a signal that the
pipe is overfull (congested) and that packet loss can be
detected one of two different ways: (a) via a retransmit
timeout or (b) when some small number (3-4) of consecutive
duplicate acks has been received (the "fast retransmit"
algorithm). In case (a), the pipe is guaranteed to be empty so
we must slow-start. In case (b), if the duplicate ack
threshhold is small compared to the bandwidth-delay product, we
will detect the loss with the pipe almost full. I.e., given a
threshhold of 3 packets and an LBL-MIT bandwidth-delay of around
24KB or 16 packets (assuming 1500 byte MTUs), the pipe is 75%
full when fast-retransmit detects a loss (actually, until
gateways start doing some sort of congestion control, the pipe
is overfull when the loss is detected so *at least* 75% of the
packets needed for ack clocking are in transit when
fast-retransmit happens). Since the pipe is full, there's no
need to slow-start after a fast-retransmit.
For (2), consider what a duplicate ack means: either the
network duplicated a packet (i.e., the NSFNet braindead IBM
token ring adapters) or the receiver got an out-of-order packet.
The usual cause of out-of-order packets at the receiver is a
missing packet. I.e., if there are W packets in transit and one
is dropped, the receiver will get W-1 out-of-order and
(4.3-tahoe TCP will) generate W-1 duplicate acks. If the
`consecutive duplicates' threshhold is set high enough, we can
reasonably assume that duplicate acks mean dropped packets.
But there's more information in the ack: The receiver can only
generate one in response to a packet arrival. I.e., a duplicate
ack means that a packet has left the network (it is now cached
at the receiver). If the sender is limitted by the congestion
window, a packet can now be sent. (The congestion window is a
count of how many packets will fit in the pipe. The ack says a
packet has left the pipe so a new one can be added to take its
place.) To put this another way, say the current congestion
window is C (i.e, C packets will fit in the pipe) and D
duplicate acks have been received. Then only C-D packets are
actually in the pipe and the sender wants to use a window of C+D
packets to fill the pipe to its estimated capacity (C+D sent -
D received = C in pipe).
So, conceptually, the slow-start/cong.avoid/fast-rexmit changes
are:
- The sender's input routine is changed to set `cwnd' to `ssthresh'
when the dup ack threshhold is reached. [It used to set cwnd to
mss to force a slow-start.] Everything else stays the same.
- The sender's output routine is changed to use an effective window
of min(snd_wnd, cwnd + dupacks*mss) [the change is the addition
of the `dupacks*mss' term.] `Dupacks' is zero until the rexmit
threshhold is reached and zero except when receiving a sequence
of duplicate acks.
The actual implementation is slightly different than the above
because I wanted to avoid the multiply in the output routine
(multiplies are expensive on some risc machines). A diff of the
old and new fastrexmit code is attached (your line numbers will
vary).
Note that we still do congestion avoidance (i.e., the window is
reduced by 50% when we detect the packet loss). But, as long as
the receiver's offered window is large enough (it needs to be at
most twice the bandwidth-delay product), we continue sending
packets (at exactly half the rate we were sending before the
loss) even after the loss is detected so the pipe stays full at
exactly the level we want and a slow-start isn't necessary.
Some algebra might make this last clear: Say U is the sequence
number of the first un-acked packet and we are using a window
size of W when packet U is dropped. Packets [U..U+W) are in
transit. When the loss is detected, we send packet U and pull
the window back to W/2. But in the round-trip time it takes
the U retransmit to fill the receiver's hole and an ack to get
back, W-1 dup acks will arrive (one for each packet in transit).
The window is effectively inflated by one packet for each of
these acks so packets [U..U+W/2+W-1) are sent. But we don't
re-send packets unless we know they've been lost so the amount
actually sent between the loss detection and the recovery ack is
U+W/2+W-1 - U+W = W/2-1 which is exactly the amount congestion
avoidance allows us to send (if we add in the rexmit of U). The
recovery ack is for packet U+W so when the effective window is
pulled back from W/2+W-1 to W/2 (which happens because the
recovery ack is `new' and sets dupack to zero), we are allowed
to send up to packet U+W+W/2 which is exactly the first packet
we haven't yet sent. (I.e., there is no sudden burst of packets
as the `hole' is filled.) Also, when sending packets between
the loss detection and the recovery ack, we do nothing for the
first W/2 dup acks (because they only allow us to send packets
we've already sent) and the bottleneck gateway is given W/2
packet times to clean out its backlog. Thus when we start
sending our W/2-1 new packets, the bottleneck queue is as empty
as it can be.
[I don't know if you can get the flavor of what happens from
this description -- it's hard to see without a picture. But I
was delighted by how beautifully it worked -- it was like
watching the innards of an engine when all the separate motions
of crank, pistons and valves suddenly fit together and
everything appears in exactly the right place at just the right
time.]
Also note that this algorithm interoperates with old tcp's: Most
pre-tahoe tcp's don't generate the dup acks on out-of-order packets.
If we don't get the dup acks, fast retransmit never fires and the
window is never inflated so everything happens in the old way (via
timeouts). Everything works just as it did without the new algorithm
(and just as slow).
If you want to simulate this, the intended environment is:
- large bandwidth-delay product (say 20 or more packets)
- receiver advertising window of two b-d p (or, equivalently,
advertised window of the unloaded b-d p but two or more
connections simultaneously sharing the path).
- average loss rate (from congestion or other source) less than
one lost packet per round-trip-time per active connection.
(The algorithm works at higher loss rate but the TCP selective
ack option has to be implemented otherwise the pipe will go empty
waiting to fill the second hole and throughput will once again
degrade at the product of the loss rate and b-d p. With selective
ack, throughput is insensitive to b-d p at any loss rate.)
And, of course, we should always remember that good engineering
practise suggests a b-d p worth of buffer at each bottleneck --
less buffer and your simulation will exhibit the interesting
pathologies of a poorly engineered network but will probably
tell you little about the workings of the algorithm (unless the
algorithm misbehaves badly under these conditions but my
simulations and measurements say that it doesn't). In these
days of $100/megabyte memory, I dearly hope that this particular
example of bad engineering is of historical interest only.
- Van
-----------------
*** /tmp/,RCSt1a26717 Mon Apr 30 01:35:17 1990
--- tcp_input.c Mon Apr 30 01:33:30 1990
***************
*** 834,850 ****
* Kludge snd_nxt & the congestion
* window so we send only this one
! * packet. If this packet fills the
! * only hole in the receiver's seq.
! * space, the next real ack will fully
! * open our window. This means we
! * have to do the usual slow-start to
! * not overwhelm an intermediate gateway
! * with a burst of packets. Leave
! * here with the congestion window set
! * to allow 2 packets on the next real
! * ack and the exp-to-linear thresh
! * set for half the current window
! * size (since we know we're losing at
! * the current window size).
*/
if (tp->t_timer[TCPT_REXMT] == 0 ||
--- 834,850 ----
* Kludge snd_nxt & the congestion
* window so we send only this one
! * packet.
! *
! * We know we're losing at the current
! * window size so do congestion avoidance
! * (set ssthresh to half the current window
! * and pull our congestion window back to
! * the new ssthresh).
! *
! * Dup acks mean that packets have left the
! * network (they're now cached at the receiver)
! * so bump cwnd by the amount in the receiver
! * to keep a constant cwnd packets in the
! * network.
*/
if (tp->t_timer[TCPT_REXMT] == 0 ||
***************
*** 853,864 ****
else if (++tp->t_dupacks == tcprexmtthresh) {
tcp_seq onxt = tp->snd_nxt;
! u_int win =
! MIN(tp->snd_wnd, tp->snd_cwnd) / 2 /
! tp->t_maxseg;
if (win < 2)
win = 2;
tp->snd_ssthresh = win * tp->t_maxseg;
-
tp->t_timer[TCPT_REXMT] = 0;
tp->t_rtt = 0;
--- 853,864 ----
else if (++tp->t_dupacks == tcprexmtthresh) {
tcp_seq onxt = tp->snd_nxt;
! u_int win = MIN(tp->snd_wnd,
! tp->snd_cwnd);
+ win /= tp->t_maxseg;
+ win >>= 1;
if (win < 2)
win = 2;
tp->snd_ssthresh = win * tp->t_maxseg;
tp->t_timer[TCPT_REXMT] = 0;
tp->t_rtt = 0;
***************
*** 866,873 ****
tp->snd_cwnd = tp->t_maxseg;
(void) tcp_output(tp);
!
if (SEQ_GT(onxt, tp->snd_nxt))
tp->snd_nxt = onxt;
goto drop;
}
} else
--- 866,879 ----
tp->snd_cwnd = tp->t_maxseg;
(void) tcp_output(tp);
! tp->snd_cwnd = tp->snd_ssthresh +
! tp->t_maxseg *
! tp->t_dupacks;
if (SEQ_GT(onxt, tp->snd_nxt))
tp->snd_nxt = onxt;
goto drop;
+ } else if (tp->t_dupacks > tcprexmtthresh) {
+ tp->snd_cwnd += tp->t_maxseg;
+ (void) tcp_output(tp);
+ goto drop;
}
} else
***************
*** 874,877 ****
--- 880,890 ----
tp->t_dupacks = 0;
break;
+ }
+ if (tp->t_dupacks) {
+ /*
+ * the congestion window was inflated to account for
+ * the other side's cached packets - retract it.
+ */
+ tp->snd_cwnd = tp->snd_ssthresh;
}
tp->t_dupacks = 0;
*** /tmp/,RCSt1a26725 Mon Apr 30 01:35:23 1990
--- tcp_timer.c Mon Apr 30 00:36:29 1990
***************
*** 223,226 ****
--- 223,227 ----
tp->snd_cwnd = tp->t_maxseg;
tp->snd_ssthresh = win * tp->t_maxseg;
+ tp->t_dupacks = 0;
}
(void) tcp_output(tp);
::::::::::::::
Received: from venera.isi.edu by NNSC.NSF.NET id aa08372; 30 Apr 90 13:36 EDT
Posted-Date: Mon, 30 Apr 90 10:36:12 PDT
Received: from vs.ee.lbl.gov by venera.isi.edu (5.61/5.61+local)
id <AA19258>; Mon, 30 Apr 90 10:35:01 -0700
Received: by helios.ee.lbl.gov (5.61/1.39)
id AA27020; Mon, 30 Apr 90 10:36:14 -0700
Message-Id: <900430173...@helios.ee.lbl.gov>
To: end2end-...@venera.isi.edu
Subject: modified TCP congestion avoidance algorithm (correction)
Date: Mon, 30 Apr 90 10:36:12 PDT
From: Van Jacobson <v...@helios.ee.lbl.gov>
I shouldn't make last minute 'fixes'. The code I sent out last
night had a small error:
*** t.c Mon Apr 30 10:28:52 1990
--- tcp_input.c Mon Apr 30 10:30:41 1990
***************
*** 885,893 ****
* the congestion window was inflated to account for
* the other side's cached packets - retract it.
*/
! tp->snd_cwnd = tp->snd_ssthresh;
}
- tp->t_dupacks = 0;
if (SEQ_GT(ti->ti_ack, tp->snd_max)) {
tcpstat.tcps_rcvacktoomuch++;
goto dropafterack;
--- 885,894 ----
* the congestion window was inflated to account for
* the other side's cached packets - retract it.
*/
! if (tp->snd_cwnd > tp->snd_ssthresh)
! tp->snd_cwnd = tp->snd_ssthresh;
! tp->t_dupacks = 0;
}
if (SEQ_GT(ti->ti_ack, tp->snd_max)) {
tcpstat.tcps_rcvacktoomuch++;
goto dropafterack;
::::::::::::::
Received: from rx7.ee.lbl.gov by uu2.psi.com (5.65b/4.0.071791-PSI/PSINet) via SMTP;
id AA12583 for popbbn; Wed, 8 Sep 93 01:29:46 -0400
Received: by rx7.ee.lbl.gov for cr...@aland.bbn.com (5.65/1.44r)
id AA05271; Tue, 7 Sep 93 22:30:15 -0700
Message-Id: <930908053...@rx7.ee.lbl.gov>
To: Craig Partridge <cr...@aland.bbn.com>
Cc: David Clark <d...@lcs.mit.edu>
Subject: Re: query about TCP header on tcp-ip
In-Reply-To: Your message of Tue, 07 Sep 93 09:48:00 PDT.
Date: Tue, 07 Sep 93 22:30:14 PDT
From: Van Jacobson <v...@ee.lbl.gov>
Craig,
As you probably remember from the "High Speed TCP" CNRI meeting,
my kernel looks nothing at all like any version of BSD. Mbufs
no longer exist, for example, and `netipl' and all the protocol
processing that used to be done at netipl interrupt level are
gone. TCP receive packet processing in the new kernel really is
about 30 instructions on a RISC (33 on a sparc but three of
those are compiler braindamage). Attached is the C code & the
associated sparc assembler.
A brief recap of the architecture: Packets go in 'pbufs' which
are, in general, the property of a particular device. There is
exactly one, contiguous, packet per pbuf (none of that mbuf
chain stupidity). On the packet input interrupt, the device
driver upcalls through the protocol stack (no more of that queue
packet on the netipl software interrupt bs). The upcalls
percolate up the stack until the packet is completely serviced
(e.g., NFS requests that can be satisfied from in-memory data &
data structures) or they reach a routine that knows the rest of
the processing must be done in some process's context in which
case the packet is laid on some appropriate queue and the
process is unblocked. In the case of TCP, the upcalls almost
always go two levels: IP finds the datagram is for this host &
it upcalls a TCP demuxer which hashes the ports + SYN to find a
PCB, lays the packet on the tail of the PCB's queue and wakes up
any process sleeping on the PCB. The IP processing is about 25
instructions & the demuxer is about 10.
As Dave noted, the two processing paths that need the most
tuning are the data packet send & receive (since at most every
other packet is acked, there will be at least twice as many data
packets as ack packets). In the new system, the receiving
process calls 'tcp_usrrecv' (the protocol specific part of the
'recv' syscall) or is already blocked there waiting for new
data. So the following code is embedded in a loop at the start of
tcp_usrrecv that spins taking packets off the pcb queue until
there's no room for the next packet in the user buffer or the
queue is empty. The TCP protocol processing is done as we
remove packets from the queue & copy their data to user space
(and since we're in process context, it's possible to do a
checksum-and-copy).
Throughout this code, 'tp' points to the pcb and 'ti' points to
the tcp header of the first packet on the queue (the ip header was
stripped as part of interrupt level ip processing). The header info
(excluding the ports which are implicit in the pcb) are sucked out
of the packet into registers [this is to minimize cache thrashing and
possibly to take advantage of 64 bit or longer loads]. Then the
header checksum is computed (tp->ph_sum is the precomputed pseudo-header
checksum + src & dst ports).
int tcp_usrrecv(struct uio* uio, struct socket* so)
{
struct tcpcb *tp = (struct tcpcb *)so->so_pcb;
register struct pbuf* pb;
while ((pb = tp->tp_inq) != 0) {
register int len = pb->len;
struct tcphdr *ti = (struct tcphdr *)pb->dat;
u_long seq = ((u_long*)ti)[1];
u_long ack = ((u_long*)ti)[2];
u_long flg = ((u_long*)ti)[3];
u_long sum = ((u_long*)ti)[4];
u_long cksum = tp->ph_sum;
/* NB - ocadd is an inline gcc assembler function */
cksum = ocadd(ocadd(ocadd(ocadd(cksum, seq), ack), flg), sum);
Next is the header prediction check which is probably the most
opaque part of the code. tp->pred_flags contains snd_wnd (the
window we expect in incoming packets) in the bottom 16 bits and
0x4x10 in the top 16 bits. The 'x' is normally 0 but will be
set non-zero if header prediction shouldn't be done (e.g., if
not in established state, if retransmitting, if hole in seq
space, etc.). So, the first term of the 'if' checks four
different things simultaneously:
- that the window is what we expect
- that there are no tcp options
- that the packet has ACK set & doesn't have SYN, FIN, RST or URG set
- that the connection is in the right state
and the 2nd term of the if checks that the packet is in sequence:
#define FMASK (((0xf000 | TH_SYN|TH_FIN|TH_RST|TH_URG|TH_ACK) << 16) | 0xffff)
if ((flg & FMASK) == tp->pred_flags && seq == tp->rcv_nxt) {
The next few lines are pretty obvious -- we subtract the header
length from the total length and if it's less than zero the packet
was malformed, if it's zero we must have a pure ack packet & we
do the ack stuff otherwise if the ack field didn't move we have
a pure data packet which we copy to the user's buffer, checksumming
as we go, then update the pcb state if everything checks:
len -= 20;
if (len <= 0) {
if (len < 0) {
/* packet malformed */
} else {
/* do pure ack things */
}
} else if (ack == tp->snd_una) {
cksum = in_uiomove((u_char*)ti + 20, len, uio, cksum);
if (cksum != 0) {
/* packet or user buffer errors */
}
seq += len;
tp->rcv_nxt = seq;
if ((int)(seq - tp->rcv_acked) >= 0) {
/* send ack */
} else {
/* free pbuf */
}
continue;
}
}
/* header prediction failed -- take long path */
...
That's it. On the normal receive data path we execute 16 lines of
C which turn into 33 instructions on a sparc (it would be 30 if I
could trick gcc into generating double word loads for the header
& my carefully aligned pcb fields). I think you could get it down
around 25 on a cray or big-endian alpha since the loads, checksum calc
and most of the compares can be done on 64 bit quantities (e.g.,
you can combine the seq & ack tests into one).
Attached is the sparc assembler for the above receive path. Hope
this explains Dave's '30 instruction' assertion. Feel free to
forward this to tcp-ip or anyone that might be interested.
- Van
----------------
ld [%i0+4],%l3 ! load packet tcp header fields
ld [%i0+8],%l4
ld [%i0+12],%l2
ld [%i0+16],%l0
ld [%i1+72],%o0 ! compute header checksum
addcc %l3,%o0,%o3
addxcc %l4,%o3,%o3
addxcc %l2,%o3,%o3
addxcc %l0,%o3,%o3
sethi %hi(268369920),%o1 ! check if hdr. pred possible
andn %l2,%o1,%o1
ld [%i1+60],%o2
cmp %o1,%o2
bne L1
ld [%i1+68],%o0
cmp %l3,%o0
bne L1
addcc %i2,-20,%i2
bne,a L3
ld [%i1+36],%o0
! packet error or ack processing
...
L3:
cmp %l4,%o0
bne L1
add %i0,20,%o0
mov %i2,%o1
call _in_uiomove,0
mov %i3,%o2
cmp %o0,0
be L6
add %l3,%i2,%l3
! checksum error or user buffer error
...
L6:
ld [%i1+96],%o0
subcc %l3,%o0,%g0
bneg L7
st %l3,[%i1+68]
! send ack
...
br L8
L7:
! free pbuf
...
L8: ! done with this packet - continue
...
L1: ! hdr pred. failed - do it the hard way