Network performance tuning.

Leo Bicknell

unread,

Jul 11, 2001, 7:50:39 PM7/11/01

to

I'm going to bring up a topic that is sure to spark a great debate
(read: flamefest), but I think it's an important issue. I've put
my nomex on, let's see where this goes.

I work for an international ISP. One of the customer complaints
that has been on the rise is poor transfer rates across our network.
When these come up, I'll often get called in to investigate. Over
the past 2-3 years there has been an alarming increase in these
complaints, and what disturbs me more is there is a simple solution
99% of the time - increase the TCP window size.

Admittedly, my environment is a bit rare. This generally comes
from colo customers who have to 100Mbps connected beefy servers on
opposite coasts and can't understand why around 100k/sec is the
best transfer rate they can get. If only we all had uncongested
100Mbps connections! Anyway, after having them up the window size
on their machines, we can, if necessary, get them up to full 100Mbps
across the country (I have logs of 9.98MB/sec FTP's coast to coast,
if anyone wants them).

So, I decided it was time to pick on FreeBSD. There are a number
of reasons, chief among them is that virtually all other OS's now
have larger default window sizes (and thus offer better performance)
than FreeBSD out of the box. A secondary reason is that there are
for the first time real end users, in the form of cable modem
subscribers being hit by this same issue.

Let's cut to the nitty gritty. This is all limited by the bandwidth
* delay product, you can ship one window per rtt, and all that.
If you don't understand this already go read about TCP then come
back to this message. :-) FreeBSD's current default is 16384 bytes
for the window, giving us the following limits on performance:

Lan 1ms rtt = 15 MB/sec
Coast to Coast 65ms rtt = 246 KB/sec
Coast to Coast 85ms rtt = 188 KB/sec
East Coast to Japan 155ms rtt = 103 KB/sec
London to Japan 225ms rtt = 71 KB/sec
T1 Satellite Link 500ms rtt = 32 KB/sec

So, inside the US, the current window, 16k, lets a single connection
just fill a T1, more or less. Note, these numbers assume optimal
conditions, the you may see a degradation of up to 50% from those
numbers when bandwidth is available, but there is high jitter, or
packets are reordered.

I wonder how many people are discontinuing DirectPC service because
they can't get over 32 KB/sec downloads from their "T1 speed"
satellite service.

One of the first responses I often get to this issue is "so what,
system administrators can increase the values". This is true,
however I think it's time to address the defaults. There are a
number of reasons for this:

* BOTH ends of a TCP connection must be increased. All the server
admins in the world can do this, but if end users don't it is
useless. Conversely, end users who do this now won't see a speed
up unless all the server admins change the settings.

* FreeBSD is at the middle-bottom of the pack when it comes to
defaults. http://www.psc.edu/networking/perf_tune.html

* Users are slowly getting faster connections (T1 DSL, T1 Satellite,
10 Mbps cable modems) that need larger values.

* The methods to get around this limit from a users point of view
is to write custom apps that up the values using the socket calls.
Hard coding window sizes into apps is a poor solution.

Unfortunately this is where things get really interesting. If you
want to say, support a 100Mbps transfer over a single TCP connection
you need a buffer around 1 Meg. That's a lot of buffer. That
said, most large servers, and even end user workstations could
devote 1 Meg to the network if it ment 100Mbps performance. Sadly,
this has unintended consequences.

If you did down in the TCP stack, you find a problem. When a socket
is created in FreeBSD (and I presume many other BSD's as well) it's
buffer limits are set (soreserve). The behavior today is to set
them to the system default values at socket creation time. So,
what happens is a dial-up user connects to a web server to download
an MP3 file. The socket sets aside a 1 Meg buffer, the web server
dumps 1 Meg into it, and then the kernel has to keep that 1 Meg
around in MBUF's until it can dribble out to the end user. No
surprise, you run out of MBUF's in a hurry.

There are a number of issues that come out of this:

* MBUF's are currently allocated based on NMBCLUSTERS, which is
based on MAXUSERS (unless overridden). NMBCLUSTERS is found
using the formula 512 + MAXUSERS * 16. This forumla has been in use
for a long time, and it may be time to consider allocating a few
more clusters per user. MBUF's is 4 * NMBCLUSTERS, which is a fine
number, but testing shows gives you too many MBUF's in many cases.
(Or, put another way, most every system I've seen shows a trend
of running out of clusters way before MBUF's.)

* The socket layer needs to be more intelligent about its buffering.
Simply always allocating the largest buffer is easy to code, but
wastes considerable resources, particular on machines with lots
of connections.

So, I'd like to propose some fixes to get people thinking. I have
ordered them in the order I think they should be done:

1) The per-socket defaults should be raised to 32k in the next
release, giving 2x today's performance in general, and putting
FreeBSD on par at least with most Linux distro's. I think the
memory consequences here are quite minor, and provide a good
place to study the effects on real world people.

2) The socket layer needs to be modified to not use the maximum
buffer as the default. Imagine if disk drivers allocated 4 Meg
for every process writing to disk, just because the disk has a
4 Meg cache. The buffer clearly needs to hold all unacknowledged
data, and should therefor grow as the window size grows, plus
some overhead so that some unsent data can be buffered in the
kernel (to avoid context switches and the like). This way
connections to slow hosts (eg dial up users) would not buffer
much more than the window size, using only a small amount of
memory. This would allow admins to set the sizes much larger
without wasting memory on connections that will never use it.

Note, from looking at soreserve and related code it appears it
just sets maximums, and that raising it midstream would have no
ill effects. (Reducing would.) So a good first stab might be
to have a new "initial socket buffer" size passed to soreserve
when a new socket is created, and if the TCP window could be
increased past that value at any point it could be recalled (or
a resize function created) that raised the limit to 2 * maxwin,
or 1.1 * maxwin, or maxwin + buffer or whatever is appropriate
up to the hard limit set by the system administrator.

3) The number of MBUF's needs to be increased. Ideally this should
be dynamically changeable, which it is not today. As the net
gets faster, users need more network resources per user, hence
more MBUF's. Also, I wonder if it should be determined from
MAXUSERS at all. It is in fact related the the maximum number
of simultaneous network connections, and it might make more
sense to base it off that, with a default based on MAXUSERS (but
larger).

Point #2 is very critical. Right now it means someone who runs a
web server must leave the values fairly low (probably ok for serving
dial up and DSL users) to not run out of MBUF's, but without much
hackery can't get high speed transfers on the nightly backup run,
or content distribution run across the network. Buffers need to
be more dynamically scaled to individual connections.

So, bottom line, in the end I would like a FreeBSD host that out
of the box can get 2-4 MBytes/sec across country (or better), but
that manages it in such a way that your standard web server running
on a FreeBSD box doesn't fall over. Is it just a pipe dream, or
can we make that happen with a little effort?

--
Leo Bicknell - bick...@ufp.org
Systems Engineer - Internetworking Engineer - CCIE 3440
Read TMBG List - tmbg-lis...@tmbg.org, www.tmbg.org

To Unsubscribe: send mail to majo...@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message

Mike Silbersack

unread,

Jul 11, 2001, 8:36:35 PM7/11/01

to

On Wed, 11 Jul 2001, Leo Bicknell wrote:

>
> I'm going to bring up a topic that is sure to spark a great debate
> (read: flamefest), but I think it's an important issue. I've put
> my nomex on, let's see where this goes.

I don't think this will start a flamefest; most of what you suggest is
definitely needed. However, the main question is one of developer time.

Bosko just rewrote the mbuf subsystem in -current, making memory
reclamation more feasible. However, I doubt much of this will be material
that can be ported back to 4.x.

You seem to have hit at the crux of the problem - we need dynamically
tuned socket buffers. I think that there are patches which implement that
feature for netbsd, perhaps they can be ported over. If you (or anyone
else with free time) would port that code, I don't think it would see
barriers to inclusion. <hint hint>

As for changing the default buffer sizes and mbuf to mbuf cluster ratio...
that could certainly spark long debate. In general, I agree with your
suggestions. However, let's cut the debate short. Since you have a bunch
of fbsd servers you can check out, track the output of netstat -m over the
course of a few days on them. From this data, the answer to the ratio
question should become clear.

Mike "Silby" Silbersack

Leo Bicknell

unread,

Jul 11, 2001, 8:33:35 PM7/11/01

to

I love responding to my own messages, but I do have something to add.
The following link, which seems to be along the right lines was given
to me by an interested party. I have BCC'ed them on this message so
they can show themselves if they want, or stay in the shadows for the
time being.

Take a look at http://www.psc.edu/networking/tcp.html, in particular
http://www.psc.edu/networking/auto.html. I can't believe I overlooked
it when I was looking at the site earlier. It's a potential fix to
point #2 in my message. It has working, if experimental, NetBSD code.
Perhaps a FreeBSD version should appear soon in a default-to-off
version so it can get out to the world.

Julian Elischer

unread,

Jul 11, 2001, 8:48:23 PM7/11/01

to

Some good points.

On Wed, 11 Jul 2001, Leo Bicknell wrote:

>
>

>
> * FreeBSD is at the middle-bottom of the pack when it comes to
> defaults. http://www.psc.edu/networking/perf_tune.html

AND we still don't have a working standard SACK implementation.

>
> There are a number of issues that come out of this:
>
> * MBUF's are currently allocated based on NMBCLUSTERS, which is
> based on MAXUSERS (unless overridden). NMBCLUSTERS is found
> using the formula 512 + MAXUSERS * 16. This forumla has been in use
> for a long time, and it may be time to consider allocating a few
> more clusters per user. MBUF's is 4 * NMBCLUSTERS, which is a fine
> number, but testing shows gives you too many MBUF's in many cases.
> (Or, put another way, most every system I've seen shows a trend
> of running out of clusters way before MBUF's.)
>

This CAN be set separatly I think..

> * The socket layer needs to be more intelligent about its buffering.
> Simply always allocating the largest buffer is easy to code, but
> wastes considerable resources, particular on machines with lots
> of connections.

certainly it should be dynamic

I'm certain we can work out the bandwidth product ..
and thus the maximum window needed...

>
> So, I'd like to propose some fixes to get people thinking. I have
> ordered them in the order I think they should be done:
>
> 1) The per-socket defaults should be raised to 32k in the next
> release, giving 2x today's performance in general, and putting
> FreeBSD on par at least with most Linux distro's. I think the
> memory consequences here are quite minor, and provide a good
> place to study the effects on real world people.

I think that being able to dynamically work out the window
would be the best idea..
The minimum RTT is discovered pretty quickly.

Keeping the buffer size no more than twice the calculated bandwidth
product presently being used would work fine. (or the max windowsize)
would probably be a good move. Your suggestion below is certainly a good
basis to start from.

>
> 2) The socket layer needs to be modified to not use the maximum
> buffer as the default. Imagine if disk drivers allocated 4 Meg
> for every process writing to disk, just because the disk has a
> 4 Meg cache. The buffer clearly needs to hold all unacknowledged
> data, and should therefor grow as the window size grows, plus
> some overhead so that some unsent data can be buffered in the
> kernel (to avoid context switches and the like). This way
> connections to slow hosts (eg dial up users) would not buffer
> much more than the window size, using only a small amount of
> memory. This would allow admins to set the sizes much larger
> without wasting memory on connections that will never use it.
>
> Note, from looking at soreserve and related code it appears it
> just sets maximums, and that raising it midstream would have no
> ill effects. (Reducing would.) So a good first stab might be
> to have a new "initial socket buffer" size passed to soreserve
> when a new socket is created, and if the TCP window could be
> increased past that value at any point it could be recalled (or
> a resize function created) that raised the limit to 2 * maxwin,
> or 1.1 * maxwin, or maxwin + buffer or whatever is appropriate
> up to the hard limit set by the system administrator.
>

Bsd...@aol.com

unread,

Jul 12, 2001, 1:50:36 PM7/12/01

to

In a message dated 07/11/2001 7:51:11 PM Eastern Daylight Time,
bick...@ufp.org writes:

> So, bottom line, in the end I would like a FreeBSD host that out
> of the box can get 2-4 MBytes/sec across country (or better), but
> that manages it in such a way that your standard web server running
> on a FreeBSD box doesn't fall over. Is it just a pipe dream, or
> can we make that happen with a little effort?
>

Of course if everyone on the internet does this, you are back to square one.

The window is there for flow control and data integrity. You seek to
undermine those concepts, which doesnt seem like a good idea for an "out of
the box" operating system

B

Leo Bicknell

unread,

Jul 12, 2001, 1:56:58 PM7/12/01

to

On Thu, Jul 12, 2001 at 01:50:05PM -0400, Bsd...@aol.com wrote:
> The window is there for flow control and data integrity. You seek to
> undermine those concepts, which doesnt seem like a good idea for an "out of
> the box" operating system

Not at all. Nothing I've suggested removes the window, or changes
the flow control properties in any way at all. What I've suggested
is that we remove an outside, artifical limiting force, so those
mechanisms can actually do what is intended.

--
Leo Bicknell - bick...@ufp.org
Systems Engineer - Internetworking Engineer - CCIE 3440
Read TMBG List - tmbg-lis...@tmbg.org, www.tmbg.org

To Unsubscribe: send mail to majo...@FreeBSD.org

Leo Bicknell

unread,

Jul 12, 2001, 9:10:11 PM7/12/01

to

On Thu, Jul 12, 2001 at 05:55:39PM +0100, Paul Robinson wrote:
> When I asked about SACK about 18 months ago (IIRC), the general consensus
> was that it was a pile of crap, and that FBSD SHOULDN'T implement it if
> possible. I however, agree that there are a lot of things in SACK that would
> massively benefit FBSD's net performance.

Does anyone know if Luigi's patches at http://www.iet.unipi.it/~luigi/sack.html
ever got wider use than his own testing? It looks like it was written some
time ago, and if people have been running it since then there might be some
real world data.

If it helps, it could be a win for web servers, as it appers Win98 has SACK
on by default.

Matt Dillon

unread,

Jul 12, 2001, 9:28:41 PM7/12/01

to

I think the crux of situation here is that the correct solution is
to introduce more dynamicy in the way the kernel handles buffer space
for tcp connections.

For example, we want to be able to sysctl net.inet.tcp.sendspace and
recvspace to high values (e.g. 65535, 1000000) without losing our
ability to scale to many connections. This implies that the kernel
must be able to dynamically reduce the 'effective' send and receive
space limit to accomodate available mbuf space when the number of
connections grows.

This is fairly easy to do for the transmit side of things and would
yield an immediate improvement in available mbuf space. For the receive
side of things we can't really do anything with existing connections
(because we've already advertised that the space is available to the
remote end), but we can certainly reduce the buffer space we reserve
for new connections. If the system is handling a large number of
connections then this sort of scaling will work fairly well due to
attrition.

We can do all of this without ripping out the pre-allocation of
buffer space. I.E. forget trying to do something fancy like
swapping out buffers or virtualizing buffers or advertising more
then we actually have etc etc etc. Think of it more in terms of
the system internally sysctl'ing down the send and receive buffer
space defaults in a dynamic fashion and doing other reclamation to
speed it along.

So in regards to Leo's suggestions. I think we can bump up our current
defaults, and I would support increasing the 16384 default to 24576 or
possibly even 32768 as well as increasing the number of mbufs. But
that is only a stopgap measure. What we really need to do is what I
just described.

-Matt

Drew Eckhardt

unread,

Jul 12, 2001, 10:17:46 PM7/12/01

to

In message <200107130128...@earth.backplane.com>, dil...@earth.backpl

ane.com writes:
> This is fairly easy to do for the transmit side of things and would
> yield an immediate improvement in available mbuf space. For the receive
> side of things we can't really do anything with existing connections
> (because we've already advertised that the space is available to the
> remote end),

You can't change the RFC 1323 window scale.

You can reduce the window size with each ACK, although this is frowned
upon.

To quote RFC 791,

The mechanisms provided allow a TCP to advertise a large window and to
subsequently advertise a much smaller window without having accepted
that much data. This, so called "shrinking the window," is strongly
discouraged. The robustness principle dictates that TCPs will not
shrink the window themselves, but will be prepared for such behavior
on the part of other TCPs.

Given a choice between failing new connections because insufficient
buffer space is available and a slow down (both from the decreased
window size and packets dropped by the sender as it's adjusting), the
later is probably preferable.

Of course, avoiding this by reducing the buffer size of new connections
before you run out is a better idea.

--
<a href="http://www.poohsticks.org/drew/">Home Page</a>
For those who do, no explanation is necessary.
For those who don't, no explanation is possible.

Alfred Perlstein

unread,

Jul 12, 2001, 10:18:46 PM7/12/01

to

* Matt Dillon <dil...@earth.backplane.com> [010712 20:28] wrote:
>
> This is fairly easy to do for the transmit side of things and would
> yield an immediate improvement in available mbuf space. For the receive
> side of things we can't really do anything with existing connections
> (because we've already advertised that the space is available to the

> remote end), but we can certainly reduce the buffer space we reserve
> for new connections. If the system is handling a large number of
> connections then this sort of scaling will work fairly well due to
> attrition.

Actually, we can shrink the window, but that's strongly discouraged
by a lot of papers/books.

> So in regards to Leo's suggestions. I think we can bump up our current
> defaults, and I would support increasing the 16384 default to 24576 or
> possibly even 32768 as well as increasing the number of mbufs. But
> that is only a stopgap measure. What we really need to do is what I
> just described.

It doesn't sound too bad to just double the current values, are you going
to commit it?

--
-Alfred Perlstein [alf...@freebsd.org]
Ok, who wrote this damn function called '??'?
And why do my programs keep crashing in it?

Mike Silbersack

unread,

Jul 12, 2001, 10:22:21 PM7/12/01

to

On Thu, 12 Jul 2001, Matt Dillon wrote:

> This is fairly easy to do for the transmit side of things and would
> yield an immediate improvement in available mbuf space. For the receive
> side of things we can't really do anything with existing connections
> (because we've already advertised that the space is available to the
> remote end), but we can certainly reduce the buffer space we reserve
> for new connections. If the system is handling a large number of
> connections then this sort of scaling will work fairly well due to
> attrition.
>
> We can do all of this without ripping out the pre-allocation of
> buffer space. I.E. forget trying to do something fancy like
> swapping out buffers or virtualizing buffers or advertising more
> then we actually have etc etc etc. Think of it more in terms of
> the system internally sysctl'ing down the send and receive buffer
> space defaults in a dynamic fashion and doing other reclamation to
> speed it along.

Buffer space isn't really preallocated, so no work needs to be done there,
as you state. The trick is auto-changing the size *correctly*. If we
bump the default up to 64K, apache will fill each socket with 64K of data,
no matter the speed of the box requesting the file. So, we have to start
small by default and scale up as the connection goes on. The criteria to
use when sclaing up is what would require the most work in making the
buffer space used dynamic.

Mike "Silby" Silbersack

Mike Silbersack

unread,

Jul 12, 2001, 10:28:09 PM7/12/01

to

On Thu, 12 Jul 2001, Alfred Perlstein wrote:

> * Matt Dillon <dil...@earth.backplane.com> [010712 20:28] wrote:
>
> Actually, we can shrink the window, but that's strongly discouraged
> by a lot of papers/books.

I doubt you really need to shrink the window ever - the fact that you've
hit the mbuf limit basically enforces that limit. And, if we're only
upping the limit based on actual ACKing of data, there's no (major) DoS
issue.

However, it would be nice to have the ability to shrink the window,
specifically in the case where there *is* a DoS going on. :)

> > So in regards to Leo's suggestions. I think we can bump up our current
> > defaults, and I would support increasing the 16384 default to 24576 or
> > possibly even 32768 as well as increasing the number of mbufs. But
> > that is only a stopgap measure. What we really need to do is what I
> > just described.
>
> It doesn't sound too bad to just double the current values, are you going
> to commit it?

I'd like to do this also, provided that we also change the mbuf to cluster
ratio from 4/1 to 2/1. This will ensure that the doubled per-socket
memory usage doesn't cause systems to run out of clusters earlier than
before.

Mike "Silby" Silbersack

Leo Bicknell

unread,

Jul 12, 2001, 10:31:09 PM7/12/01

to

On Thu, Jul 12, 2001 at 08:17:14PM -0600, Drew Eckhardt wrote:
> You can reduce the window size with each ACK, although this is frowned
> upon.

There's "frowned upon" and "frowned upon". :-) For instance, if
the only reason it's discouraged is because it causes connections
to start running slower, then I would consider that something worth
rethinking. If there are cases where it actively causes issues
for the far end stack, then it probably should be avoided.

That RFC is old enough that it's worth double checking that the
recommendations still make sence today.

--
Leo Bicknell - bick...@ufp.org
Systems Engineer - Internetworking Engineer - CCIE 3440
Read TMBG List - tmbg-lis...@tmbg.org, www.tmbg.org

To Unsubscribe: send mail to majo...@FreeBSD.org

Leo Bicknell

unread,

Jul 12, 2001, 10:39:06 PM7/12/01

to

On Thu, Jul 12, 2001 at 09:27:54PM -0500, Mike Silbersack wrote:
> I'd like to do this also, provided that we also change the mbuf to cluster
> ratio from 4/1 to 2/1. This will ensure that the doubled per-socket
> memory usage doesn't cause systems to run out of clusters earlier than
> before.

This is sort of backwards. Today we have (kern/uipc_mbuf.c):

#ifndef NMBCLUSTERS
#define NMBCLUSTERS (512 + MAXUSERS * 16)
#endif
TUNABLE_INT_DECL("kern.ipc.nmbclusters", NMBCLUSTERS, nmbclusters);
TUNABLE_INT_DECL("kern.ipc.nmbufs", NMBCLUSTERS * 4, nmbufs);

What you actually want to do is double the number of clusters:

#define NMBCLUSTERS (512 + MAXUSERS * 32)

And then do half as many mbuf's per cluster:

TUNABLE_INT_DECL("kern.ipc.nmbufs", NMBCLUSTERS * 2, nmbufs);

I think. Here's a sample from a system I run (netstat -m):

151/5024/18432 mbufs in use (current/peak/max):
128/4608/4608 mbuf clusters in use (current/peak/max)

As you can see, clusters peaked, while mbuf's were only 1/3 used.
I want to see some data points from other types of servers before
saying this really is a good idea. That said, so far every system
I've checked runs out of clusters before mbuf's.

Can some other people check systems in various forms of use?

--
Leo Bicknell - bick...@ufp.org
Systems Engineer - Internetworking Engineer - CCIE 3440
Read TMBG List - tmbg-lis...@tmbg.org, www.tmbg.org

To Unsubscribe: send mail to majo...@FreeBSD.org

E.B. Dreger

unread,

Jul 13, 2001, 12:10:22 AM7/13/01

to

> Date: Thu, 12 Jul 2001 21:09:44 -0400
> From: Leo Bicknell <bick...@ufp.org>

> http://www.iet.unipi.it/~luigi/sack.html

Hmmmm. I don't yet know enough about kernel architecture to know
in advance how I'd fare trying to patch that into 4.x (I expect
the line number to be off, obviously), but I have a box at home:

* 4.x (currently 4.3-R)
* Running squid
* Win 98 clients pulling Web pages through the proxy
* Used for telnet and ssh

if testing would be of service.

Eddy

---------------------------------------------------------------------------
Brotsman & Dreger, Inc. - EverQuick Internet Division
Phone: +1 (316) 794-8922 Wichita/(Inter)national
Phone: +1 (785) 865-5885 Lawrence
---------------------------------------------------------------------------

Rik van Riel

unread,

Jul 13, 2001, 9:53:38 AM7/13/01

to

On Thu, 12 Jul 2001, Matt Dillon wrote:

> yield an immediate improvement in available mbuf space. For the receive
> side of things we can't really do anything with existing connections
> (because we've already advertised that the space is available to the
> remote end),

In emergencies it should be easy enough to just not ack
the packets and drop them, this should cause the remote
end to slow down and the connection to use less memory.

Not the most elegant method, but probably usable DoS
protection.

cheers,

Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...

http://www.surriel.com/ http://distro.conectiva.com/

Send all your spam to aard...@nl.linux.org (spam digging piggy)

Dan Nelson

unread,

Jul 13, 2001, 10:53:04 AM7/13/01

to

In the last episode (Jul 12), Leo Bicknell said:
> On Thu, Jul 12, 2001 at 05:55:39PM +0100, Paul Robinson wrote:
> > When I asked about SACK about 18 months ago (IIRC), the general
> > consensus was that it was a pile of crap, and that FBSD SHOULDN'T
> > implement it if possible. I however, agree that there are a lot of
> > things in SACK that would massively benefit FBSD's net performance.

Considering that w2k and Linux both have sack enabled by default, it's
not going away. Do you have a link to the thread that says sack
doesn't help?

> Does anyone know if Luigi's patches at
> http://www.iet.unipi.it/~luigi/sack.html ever got wider use than his
> own testing? It looks like it was written some time ago, and if
> people have been running it since then there might be some real world
> data.

It was for 2.1.6, but I had patches to make it work on 2.2.* and 3.0,
but then the TCP stack changed enough that I couldn't keep it working.
I didn't have any problems with it in the couple of years I had it on
my system.

There was a post in June on the -net mailinglist from a guy that is
working on getting SACK into -STABLE, so there's hope yet.

--
Dan Nelson
dne...@emsphone.com

Paul Robinson

unread,

Jul 13, 2001, 11:10:20 AM7/13/01

to

On Jul 13, Dan Nelson <dne...@emsphone.com> wrote:
>
> Considering that w2k and Linux both have sack enabled by default, it's
> not going away. Do you have a link to the thread that says sack
> doesn't help?

I agree SACK is useful, but like I say, I ended up in a flamewar IIRC
because people couldn't agree whether it was any good.

Alas, the FreeBSD mail archive search engine is not very good at tracking
this thread down, but I think it was on freebsd-net that I brought it up,
and it was perhaps nearly two years ago now. I was probably using a
different mail address as well. I'll try and go back and find it later this
weekend and give you a synopsis of what went on there.

> There was a post in June on the -net mailinglist from a guy that is
> working on getting SACK into -STABLE, so there's hope yet.

Hope so. Like I say, it's one of the things I always wanted to see in
FBSD. Oh, apart from proper clustering support. ;-)

--
Paul Robinson ,---------------------------------------
Technical Director @ Akita | A computer lets you make more mistakes
PO Box 604, Manchester, M60 3PR | than any other invention with the
T: +44 (0) 161 228 6388 (F:6389)| possible exceptions of handguns and
| Tequila - Mitch Ratcliffe
`-----

Matt Dillon

unread,

Jul 13, 2001, 1:09:26 PM7/13/01

to

Ok, I'm about half way through writing up a patch set to implement
the bandwidth delay product write-buffer calculation. It may still be a
few days before it is presentable. The TCP stack already has almost
everything required to do the calculation. Adding a fair-share piece in
addition to the bwdelay product will also be fairly easy, at least for
the write-side.

The basic problem with calculating the bandwidth delay product is that
it is an inherently unstable calculation. It has to be a continuous,
slow moving calculation or you have a good chance of destabilizing the
network (think about having a thousand connections all trying to figure
out each of their bandwidth delay productions for traffic sharing random
hops in the network). The advantage of being able to do it successfully,
however, is huge. Rather then allowing the TCP connection to start to
lose packets before backing off, the bandwidth delay product calculation
instead focuses on not overloading the network in the first place while
still maintaining optimum performance.

I took a look at the paper Leo pointed out to me:

http://www.psc.edu/networking/auto.html

It's a very interesting paper, and the graphs do in fact show the type
of instability that can occur. The code is a mess, though. I think it
is possible to generate a much less noisy patch set by taking a higher
level approach to solving the problem.

E.B. Dreger

unread,

Jul 13, 2001, 1:49:36 PM7/13/01

to

> Date: Fri, 13 Jul 2001 13:29:03 -0400
> From: Leo Bicknell <bick...@ufp.org>

(The window autotuning was an interesting read...)

> I think you're doing good work, but I'm concerned you're going
> down a road that's going to take a very long time to get right.
> It is not necessary to calculate the bandwidth*delay in order to
> prevent over-buffering. Preventing overbuffering only requires
> tracking the maximum bandwidth*delay value, assuming that we
> always want the ability to buffer that much data. I think the
> number of cases where it decreases significantly over the peak
> for a long enough time to make a difference is minimal.

This sort of reminds me of semi-empirical vs. ab initio molecular
modeling in Chemistry. MOPAC is great for quick calculations with
fairly loose tolerances; a good semi-empirical calculation is
better than a crude ab initio. However, for super-high precision,
lengthy ab initio calculations in Gaussian are the rule.

Being a realtime application, I'd presume that empirical methods
are better for this sort of work. Jitter would also screw up the
bw*del values.

> Fully knowing the value over time could lead to optimizations
> like shrinking the buffers, or attempting to prevent some
> packet loss by not over-increasing the window. However
> oscellation and other issues I think are going to make this
> very complex.

I'd imagine that dynamic buffer sizing (high water / low water)
would make more sense. For new connections, we could pull from
a cache of memoized, empirically-determined window sizes keyed
by subnet* or IP, a la ARP- or route-caching.

* The implementation would be ugly and imprecise, but a userspace
daemon to give hints based on BGP or OSPF tables might be
interesting. Probably impractical, but I'm just brainstorming.

I'd imagine that it would make more sense to find a subnet that
includes two IPs with similar window sizes, and aggregate them
automatically. Perhaps do this on the insert.

As for preventing oscillation, one can use moving averages and
still stick with integer math if needed for x86 performance.

Eddy

---------------------------------------------------------------------------
Brotsman & Dreger, Inc. - EverQuick Internet Division
Phone: +1 (316) 794-8922 Wichita/(Inter)national
Phone: +1 (785) 865-5885 Lawrence
---------------------------------------------------------------------------

Date: Mon, 21 May 2001 11:23:58 +0000 (GMT)
From: A Trap <blac...@brics.com>
To: blac...@brics.com
Subject: Please ignore this portion of my mail signature.

These last few lines are a trap for address-harvesting spambots. Do NOT
send mail to <blac...@brics.com>, or you are likely to be blocked.

Terry Lambert

unread,

Jul 13, 2001, 1:57:52 PM7/13/01

to

Matt Dillon wrote:
> This is fairly easy to do for the transmit side of things and

> would yield an immediate improvement in available mbuf space.

> For the receive side of things we can't really do anything
> with existing connections (because we've already advertised

> that the space is available to the remote end), but we can

> certainly reduce the buffer space we reserve for new
> connections. If the system is handling a large number of
> connections then this sort of scaling will work fairly well
> due to attrition.

It's easy for the receive side, too: advertise smaller
windows with early ACK's.

> We can do all of this without ripping out the pre-allocation of
> buffer space. I.E. forget trying to do something fancy like
> swapping out buffers or virtualizing buffers or advertising more
> then we actually have etc etc etc. Think of it more in terms of
> the system internally sysctl'ing down the send and receive buffer
> space defaults in a dynamic fashion and doing other reclamation to
> speed it along.

The problem is that the tcpcb's, inpcb's, etc., are all
pre-reserved out of the KVA space map, so that they can
be allocated safely at interrupt, or because "that's how
the zone allocator works".

Nothing you do will recover that KVA space for reuse,
since zones are defined to be type-stable.

-- Terry

Leo Bicknell

unread,

Jul 13, 2001, 2:04:01 PM7/13/01

to

On Fri, Jul 13, 2001 at 10:58:06AM -0700, Terry Lambert wrote:
> > We can do all of this without ripping out the pre-allocation of
> > buffer space. I.E. forget trying to do something fancy like
> > swapping out buffers or virtualizing buffers or advertising more
> > then we actually have etc etc etc. Think of it more in terms of
> > the system internally sysctl'ing down the send and receive buffer
> > space defaults in a dynamic fashion and doing other reclamation to
> > speed it along.
>
> The problem is that the tcpcb's, inpcb's, etc., are all
> pre-reserved out of the KVA space map, so that they can
> be allocated safely at interrupt, or because "that's how
> the zone allocator works".

I think the only critical resource here is MBUF's, which today are
preallocated at boot time. There are memory fragmentation concerns
with allocating/deallocating them on the fly.

I am not going to even attempt to get into the world of kernel
memory allocators, that's way out of my league. That said, the
interesting cases (in increasing order of difficulty):

1) Allowing an admin to change the number of MBUF's on the fly
(with sysctl). Presumably these would be infrequent events.

2) Allowing MBUF's to be allocated/deallocated in fixed size
blocks easy for the allocator to deal with. (Eg, you always
have 128k to 4 M of MBUF's allocated in 128k chunks.)

3) Allowing MBUF's to be fully dynamically allocated.

I'm not sure I see any value to #3. I see huge value to #1
(when you run low, you can say double the number on an active
server). If we get the warning I want (from another message)
#1 becomes even more useful.

#2 would take some study. The root question is does allocating
them in blocks eliminate the memory fragmentation concern for
the kernel allocator? If the answer is yes, it's probably something
to look into, if the answer is no, probably not.

--
Leo Bicknell - bick...@ufp.org
Systems Engineer - Internetworking Engineer - CCIE 3440
Read TMBG List - tmbg-lis...@tmbg.org, www.tmbg.org

To Unsubscribe: send mail to majo...@FreeBSD.org

Terry Lambert

unread,

Jul 13, 2001, 2:19:29 PM7/13/01

to

Mike Silbersack wrote:
> > Actually, we can shrink the window, but that's strongly discouraged
> > by a lot of papers/books.
>
> I doubt you really need to shrink the window ever - the fact that you've
> hit the mbuf limit basically enforces that limit. And, if we're only
> upping the limit based on actual ACKing of data, there's no (major) DoS
> issue.

Ask Julian about this. He has some very smart code in the
InterJet; I'm not sure it ever made it into production, or
even out of Whistle.

-- Terry

Terry Lambert

unread,

Jul 13, 2001, 2:23:35 PM7/13/01

to

Leo Bicknell wrote:
> On Thu, Jul 12, 2001 at 08:17:14PM -0600, Drew Eckhardt wrote:
> > You can reduce the window size with each ACK, although this is frowned
> > upon.
>
> There's "frowned upon" and "frowned upon". :-) For instance, if
> the only reason it's discouraged is because it causes connections
> to start running slower, then I would consider that something worth
> rethinking. If there are cases where it actively causes issues
> for the far end stack, then it probably should be avoided.
>
> That RFC is old enough that it's worth double checking that the
> recommendations still make sence today.

You can congest intermediate hop routers as a result of
them buffering more data than you are now willing to
accept.

Julian's approach was much better, and much cleverer...

-- Terry

Matt Dillon

unread,

Jul 13, 2001, 2:47:54 PM7/13/01

to

:
:On Fri, Jul 13, 2001 at 10:08:57AM -0700, Matt Dillon wrote:
:> The basic problem with calculating the bandwidth delay product is that

:> it is an inherently unstable calculation. It has to be a continuous,

:
:I think you're doing good work, but I'm concerned you're going down

:a road that's going to take a very long time to get right. It is not
:necessary to calculate the bandwidth*delay in order to prevent over-

:buffering. Preventing overbuffering only requires tracking the maximum

:bandwidth*delay value, assuming that we always want the ability to
:buffer that much data. I think the number of cases where it decreases
:significantly over the peak for a long enough time to make a difference
:is minimal.

:
:Fully knowing the value over time could lead to optimizations like

:shrinking the buffers, or attempting to prevent some packet loss by
:not over-increasing the window. However oscellation and other issues
:I think are going to make this very complex.

:
:--

:Leo Bicknell - bick...@ufp.org
:Systems Engineer - Internetworking Engineer - CCIE 3440

Well, you'd be surprised. 90% of the world still uses modems, so
from the point of view of a web server it would be a big win. The
bigger picture is more complex... certainly the instability of the
algorithm is a big issue, but it opens the door to research because
congestion avoidance and bandwidth and latency guarentees across an
arbitrary network ultimately come down to having to deal with something
like this.

Leo Bicknell

unread,

Jul 13, 2001, 2:51:53 PM7/13/01

to

On Fri, Jul 13, 2001 at 11:47:19AM -0700, Matt Dillon wrote:
> Well, you'd be surprised. 90% of the world still uses modems, so
> from the point of view of a web server it would be a big win. The

Doesn't that sort of make my point though? With the current defaults of
16k/socket there is no trouble filling modems, and no one seems worried
about the amount of memory that uses (basically all the installed machines
out there are running just fine).

So, if we leave a hard minimum of 16k/socket, just chalk that up to waste,
and call it good enough we only have to handle the 10% of the world using
more.

It would be nice to have the code to scale down 100 modem users to 8k, rather
than 16k, but that's still only 800k of memory recovery (for 100 simultaneous
connections), and we're talking about the ability to support streams that
need up to 1M per stream of buffer, so 800k seems "interesting" but not
"important".

Better would probably be to lower the default to 8k, more than enough for
modem users, and let the scale up code hit the few 16k people.

*shrug*

--
Leo Bicknell - bick...@ufp.org
Systems Engineer - Internetworking Engineer - CCIE 3440

Read TMBG List - tmbg-lis...@tmbg.org, www.tmbg.org

To Unsubscribe: send mail to majo...@FreeBSD.org

Julian Elischer

unread,

Jul 13, 2001, 3:08:25 PM7/13/01

to

terry is servicing 1,000,000 connections..
so I'm sure the savings are real to him...

Julian

Leo Bicknell

unread,

Jul 13, 2001, 3:16:18 PM7/13/01

to

On Fri, Jul 13, 2001 at 01:48:57PM -0700, Julian Elischer wrote:
> terry is servicing 1,000,000 connections..
> so I'm sure the savings are real to him...

I will be the first to suggest that there are some small number
of server configurations that require some amount of hand tuning
in order to get optimum performance.

I think the likehood that he's doing that with a 100% stock "out
of the box" config is well, 0, so he's quite free to continue
to hand configure things.

That said, as I mentioned the first time, I think the code to do
the savings is good, and should be developed. I think the number
of cases helped by the savings is an order of magnitude less
than the number of cases helped by the increases though.

Terry Lambert

unread,

Jul 13, 2001, 3:29:38 PM7/13/01

to

Leo Bicknell wrote:
> > The problem is that the tcpcb's, inpcb's, etc., are all
> > pre-reserved out of the KVA space map, so that they can
> > be allocated safely at interrupt, or because "that's how
> > the zone allocator works".
>
> I think the only critical resource here is MBUF's, which today are
> preallocated at boot time. There are memory fragmentation concerns
> with allocating/deallocating them on the fly.

The tcpcb's, inpcb's, etc. are in a similar boat; see "zalloci"
and "ziniti".

> I am not going to even attempt to get into the world of kernel
> memory allocators, that's way out of my league. That said, the
> interesting cases (in increasing order of difficulty):

I have an allocator that addresses the fragmentation issues;
it can be jammed into a Dynix allocator (Bosko/Alfred-style),
as well, pretty easily. I haven't done that because of the
need to have a three tier scheme (Dynix uses a two tier) to
allow recovery of the resource blocks over time to make them
non-type-stable, and therefore capable of being repurposed
(Dynix does this). The third tier is to grab a contiguous
chunk of KVA to back the second tier, so that allocations can
occur at interrupt time (as in the current zone allocator,
which prereserves the page table mappings).

The zone allocator also aligns to 32 byte boundaries, when
really it should only be aligning to sizeof(long) boundaries
(my allocator does this for internal object boundaries, and
does not have wasted "partial pages").

The main problem is that, in order to do interrupt level
allocations, the ziniti() expects to preallocate the page
table mappings (just as the mbuf allocation does), so that
it can be filled from free RAM. This is also the reason
that running out of free RAM causes mbuf allocations "to do
bad things": you can't overcommit pages that are going to
be assigned at fault-in-interrupt time.

> 1) Allowing an admin to change the number of MBUF's on the fly
> (with sysctl). Presumably these would be infrequent events.

This is pretty much "not a chance in hell"; even though
they are sized such that page size is an even multiple of
mbuf size, the allocator can't really handle the idea of
the zone not being contiguous, since there are other things
that end up not being size suc that page size modulo object
size does not have a remainder (e.g. 192 bytes for a tcpcb).

Thus, you can not get away from the KVA contiguity requirement,
without seperating memory into interupt and non-interrupt
zones on one axis, and high, medium, and low persistance
objects on another axis, and size of object cluster objects
on a third axis.

This gets even more complex when you factor in per-CPU memory
pools for SMP.

> 2) Allowing MBUF's to be allocated/deallocated in fixed size
> blocks easy for the allocator to deal with. (Eg, you always
> have 128k to 4 M of MBUF's allocated in 128k chunks.)

The problem with this is still that the page mappings must
exist, since mbufs are allocated by drivers at interrupt
out of preassigned KVA space. In a livelock situation, you
will find that you will not be able to go into non-interrupt
space to grab your next 4M KVA space chunk. Setting arbitrary
power of two size limits is also bad, unless your allocator
is very, very clever. It's impossible to be that clever with
a fixed size "superallocation" target: you have to think in
terms of page units.

> 3) Allowing MBUF's to be fully dynamically allocated.
>
> I'm not sure I see any value to #3. I see huge value to #1
> (when you run low, you can say double the number on an active
> server). If we get the warning I want (from another message)
> #1 becomes even more useful.

Can't happen, without a complete rework, so that allocations
at interrupt are permissable. The major problem here is that
you have a finite KVA space, and you can't reuse it without
swapping, and you can't swap to disk in the middle of a network
interrupt. It's a chicken-and-egg problem. I'm not aware of
an OS that has solved it (not to mention that your swap may be
NFS mounted).

> #2 would take some study. The root question is does allocating
> them in blocks eliminate the memory fragmentation concern for
> the kernel allocator? If the answer is yes, it's probably something
> to look into, if the answer is no, probably not.

Not as it presently exists. The fragmentation concern is over
the contiguity of the region, not over having fragments lying
around. Realize that, in the limit, it's possible to defrag
the KVA space, since as long as the data is not in the defrag
code path, we're just talking about objects that are allocated
in the KVA space, which isn't the physical space, and we only
rarely care about physical contiguity. Doing this causes some
problems, but they are problems we currently have (e.g. drivers
that _do_ care about physical contiguity being unable to allocate
physical contiguous space can no longer have physical memory
defragged for them to make a large enough contiguous region
available -- we don't defrag at all, today), since you will be
carrying around physical instead of virtual addresses for your
allocations, and ptov'ing them for kernel use, instead of vtop'ing
them for driver use.

It wouldn't take as much study as it would take a hell of a lot
of work.

-- Terry

Julian Elischer

unread,

Jul 15, 2001, 4:26:12 AM7/15/01

to

Matt Dillon wrote:

>
> I took a look at the paper Leo pointed out to me:
>
> http://www.psc.edu/networking/auto.html
>
> It's a very interesting paper, and the graphs do in fact show the type
> of instability that can occur. The code is a mess, though. I think it
> is possible to generate a much less noisy patch set by taking a higher
> level approach to solving the problem.

there are a couple of problems that can occur:
Imagine the following scenario..

machine (with fixed buffer size) is transmitting at N bps on average,
but occasionally cannot send because the window is less than that
needed for continuous sending. because of that, an intermediate queue
does not overflow..

Now, we add adjustable queue sizes.. and suddenly we are overflowing the
intermediate
queue, and dropping packets. Since we don't have SACK we are resending
lots of data and dropping back the window size at regular intervals. thus
it is possible that under some situations teh adjustable buffer size
may result in WORSE throughput.
That brings up one thing I never liked about the current TCP,
which is that we need to keep testing the upper window size to ensure that
we notice if the bandwidth increases. Unfortunatly the only way we can do this
is by
increasing the windowsize, until we lose a packet (again).

There was an interesting paper that explored loss-avoidance techniques.
these included noticing teh increased latency that can occur when
an intermediate node starts to become overloaded. Unfortunatly,
usually we are not the person overloading it so us backing off
doesn't help a lot in many cases. I did some work at whistle
trying to predict and control remote congestion, but it was mostly useful
when the slowest link was your local loop and didn't help much if the
link was firther away.
Still, it did allow interactive sessions to run in prarllel with bulk
sessions and still get reasonable reaction times. basically I metered
out the ACKS going the other way (out) in order to minimise the
incoming queue size at the remote end of the incoming link. :-)

This is all getting a bit far from the original topic, but
I do worry that we may increase our packet loss with variable buffers and thus
reduce throughout in the cases where teh fixed buffer was getting 80%
or so of the theoretical throughout.

julian

>
> -Matt
>
> To Unsubscribe: send mail to majo...@FreeBSD.org
> with "unsubscribe freebsd-hackers" in the body of the message

--
+------------------------------------+ ______ _ __
| __--_|\ Julian Elischer | \ U \/ / hard at work in
| / \ jul...@elischer.org +------>x USA \ a very strange
| ( OZ ) \___ ___ | country !
+- X_.---._/ presently in San Francisco \_/ \\
v

Matt Dillon

unread,

Jul 15, 2001, 5:45:36 AM7/15/01

to

Ok, here is a patch set that tries to adjust the transmit congestion
window and socket buffer space according to the bandwidth product of
the link. THIS PATCH IS AGAINST STABLE!

I make calculations based on bandwidth and round-trip-time. I spent
a lot of time trying to write an algorithm that just used one or the
other, but it turns out that bandwidth is only a stable metric when
you are reducing the window, and rtt is only a stable metric when
you are increasing the window.

The algorithm is basically: decrease the window until we notice
that the throughput is going down, then increase the window until we
notice the RTT is going up (indicating buffering in the network).
However, it took quite a few hours for me to find something that
worked across a wide range of bandwidths and pipe delays. I had to
deal with oscillations at high bandwidths, instability with the
metrics being used in certain situations, and calculation overshoot
and undershoot due to averaging. The biggest breakthrough occured when
I stopped trying to time the code based on each ack coming back but
instead timed it based on the round-trip-time interval (using the rtt
calculation to trigger the windowing code).

I used dummynet (aka 'ipfw pipe') as well as my LAN and two T1's two
test it.

sysctl's:

net.inet.tcp.tcp_send_dynamic_enable

0 - disabled (old behavior) (default)
1 - enabled, no debugging output
2 - enabled, debug output to console (only really useful when
testing one or two connections).

net.inet.tcp.tcp_send_dynamic_min

min buffering (4096 default)

This parameter specifies the absolute smallest buffer size the
dynamic windowing code will go down to. The default is 4096 bytes.
You may want to set this to 4096 or 8192 to avoid degenerate
conditions on very high speed networks, or if you want to enforce
a minimum amount of socket buffering.

I got some pretty awesome results when I tested it... I was able to
create a really slow, low bandwidth dummynet link, start a transfer
that utilizes 100% of the bandwidth, and I could still type in another
xterm window that went through the same dummynet. There are immediate
uses for something like this for people who have modem links, not
to mention many other reasons.

-Matt

Index: kern/uipc_socket.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/uipc_socket.c,v
retrieving revision 1.68.2.16
diff -u -r1.68.2.16 uipc_socket.c
--- kern/uipc_socket.c 2001/06/14 20:46:06 1.68.2.16
+++ kern/uipc_socket.c 2001/07/13 04:05:38
@@ -519,12 +519,44 @@
snderr(so->so_proto->pr_flags & PR_CONNREQUIRED ?
ENOTCONN : EDESTADDRREQ);
}
- space = sbspace(&so->so_snd);
+
+ /*
+ * Calculate the optimal write-buffer size and then reduce
+ * by the amount already in use. Special handling is required
+ * to ensure that atomic writes still work as expected.
+ *
+ * Note: pru_sendpipe() only returns the optimal transmission
+ * pipe size, which is roughly equivalent to what can be
+ * transmitted and unacked. To avoid excessive process
+ * wakeups we double the returned value for our recommended
+ * buffer size.
+ */
+ if (so->so_proto->pr_usrreqs->pru_sendpipe == NULL) {
+ space = sbspace(&so->so_snd);
+ } else {
+ space = (*so->so_proto->pr_usrreqs->pru_sendpipe)(so) * 2;
+ if (atomic && space < resid + clen)
+ space = resid + clen;
+ if (space < so->so_snd.sb_lowat)
+ space = so->so_snd.sb_lowat;
+ if (space > so->so_snd.sb_hiwat)
+ space = so->so_snd.sb_hiwat;
+ space = sbspace_using(&so->so_snd, space);
+ }
+
if (flags & MSG_OOB)
space += 1024;
+
+ /*
+ * Error out if the request is impossible to satisfy.
+ */
if ((atomic && resid > so->so_snd.sb_hiwat) ||
clen > so->so_snd.sb_hiwat)
snderr(EMSGSIZE);
+
+ /*
+ * Block if necessary.
+ */
if (space < resid + clen && uio &&
(atomic || space < so->so_snd.sb_lowat || space < clen)) {
if (so->so_state & SS_NBIO)
@@ -537,6 +569,7 @@
goto restart;
}
splx(s);
+
mp = &top;
space -= clen;
do {
Index: kern/uipc_usrreq.c
===================================================================
RCS file: /home/ncvs/src/sys/kern/uipc_usrreq.c,v
retrieving revision 1.54.2.5
diff -u -r1.54.2.5 uipc_usrreq.c
--- kern/uipc_usrreq.c 2001/03/05 13:09:01 1.54.2.5
+++ kern/uipc_usrreq.c 2001/07/13 03:56:02
@@ -427,7 +427,7 @@
uipc_connect2, pru_control_notsupp, uipc_detach, uipc_disconnect,
uipc_listen, uipc_peeraddr, uipc_rcvd, pru_rcvoob_notsupp,
uipc_send, uipc_sense, uipc_shutdown, uipc_sockaddr,
- sosend, soreceive, sopoll
+ sosend, soreceive, sopoll, pru_sendpipe_notsupp
};

/*
Index: net/raw_usrreq.c
===================================================================
RCS file: /home/ncvs/src/sys/net/raw_usrreq.c,v
retrieving revision 1.18
diff -u -r1.18 raw_usrreq.c
--- net/raw_usrreq.c 1999/08/28 00:48:28 1.18
+++ net/raw_usrreq.c 2001/07/13 03:56:12
@@ -296,5 +296,5 @@
pru_connect2_notsupp, pru_control_notsupp, raw_udetach,
raw_udisconnect, pru_listen_notsupp, raw_upeeraddr, pru_rcvd_notsupp,
pru_rcvoob_notsupp, raw_usend, pru_sense_null, raw_ushutdown,
- raw_usockaddr, sosend, soreceive, sopoll
+ raw_usockaddr, sosend, soreceive, sopoll, pru_sendpipe_notsupp
};
Index: net/rtsock.c
===================================================================
RCS file: /home/ncvs/src/sys/net/rtsock.c,v
retrieving revision 1.44.2.4
diff -u -r1.44.2.4 rtsock.c
--- net/rtsock.c 2001/07/11 09:37:37 1.44.2.4
+++ net/rtsock.c 2001/07/13 03:56:16
@@ -266,7 +266,7 @@
pru_connect2_notsupp, pru_control_notsupp, rts_detach, rts_disconnect,
pru_listen_notsupp, rts_peeraddr, pru_rcvd_notsupp, pru_rcvoob_notsupp,
rts_send, pru_sense_null, rts_shutdown, rts_sockaddr,
- sosend, soreceive, sopoll
+ sosend, soreceive, sopoll, pru_sendpipe_notsupp
};

/*ARGSUSED*/
Index: netatalk/ddp_usrreq.c
===================================================================
RCS file: /home/ncvs/src/sys/netatalk/ddp_usrreq.c,v
retrieving revision 1.17
diff -u -r1.17 ddp_usrreq.c
--- netatalk/ddp_usrreq.c 1999/04/27 12:21:14 1.17
+++ netatalk/ddp_usrreq.c 2001/07/13 03:56:25
@@ -581,5 +581,6 @@
at_setsockaddr,
sosend,
soreceive,
- sopoll
+ sopoll,
+ pru_sendpipe_notsupp
};
Index: netatm/atm_aal5.c
===================================================================
RCS file: /home/ncvs/src/sys/netatm/atm_aal5.c,v
retrieving revision 1.6
diff -u -r1.6 atm_aal5.c
--- netatm/atm_aal5.c 1999/10/09 23:24:59 1.6
+++ netatm/atm_aal5.c 2001/07/13 03:56:40
@@ -101,7 +101,8 @@
atm_aal5_sockaddr, /* pru_sockaddr */
sosend, /* pru_sosend */
soreceive, /* pru_soreceive */
- sopoll /* pru_sopoll */
+ sopoll, /* pru_sopoll */
+ pru_sendpipe_notsupp /* pru_sendpipe */
};
#endif

Index: netatm/atm_usrreq.c
===================================================================
RCS file: /home/ncvs/src/sys/netatm/atm_usrreq.c,v
retrieving revision 1.6
diff -u -r1.6 atm_usrreq.c
--- netatm/atm_usrreq.c 1999/08/28 00:48:39 1.6
+++ netatm/atm_usrreq.c 2001/07/13 03:58:57
@@ -73,6 +73,10 @@
pru_sense_null, /* pru_sense */
atm_proto_notsupp1, /* pru_shutdown */
atm_proto_notsupp3, /* pru_sockaddr */
+ NULL, /* pru_sosend */
+ NULL, /* pru_soreceive */
+ NULL, /* pru_sopoll */
+ pru_sendpipe_notsupp /* pru_sendpipe */
};
#endif

Index: netgraph/ng_socket.c
===================================================================
RCS file: /home/ncvs/src/sys/netgraph/ng_socket.c,v
retrieving revision 1.11.2.3
diff -u -r1.11.2.3 ng_socket.c
--- netgraph/ng_socket.c 2001/02/02 11:59:27 1.11.2.3
+++ netgraph/ng_socket.c 2001/07/13 03:59:30
@@ -907,7 +907,8 @@
ng_setsockaddr,
sosend,
soreceive,
- sopoll
+ sopoll,
+ pru_sendpipe_notsupp
};

static struct pr_usrreqs ngd_usrreqs = {
@@ -930,7 +931,8 @@
ng_setsockaddr,
sosend,
soreceive,
- sopoll
+ sopoll,
+ pru_sendpipe_notsupp
};

/*
Index: netinet/ip_divert.c
===================================================================
RCS file: /home/ncvs/src/sys/netinet/ip_divert.c,v
retrieving revision 1.42.2.3
diff -u -r1.42.2.3 ip_divert.c
--- netinet/ip_divert.c 2001/02/27 09:41:15 1.42.2.3
+++ netinet/ip_divert.c 2001/07/13 03:59:47
@@ -540,5 +540,5 @@
pru_connect_notsupp, pru_connect2_notsupp, in_control, div_detach,
div_disconnect, pru_listen_notsupp, in_setpeeraddr, pru_rcvd_notsupp,
pru_rcvoob_notsupp, div_send, pru_sense_null, div_shutdown,
- in_setsockaddr, sosend, soreceive, sopoll
+ in_setsockaddr, sosend, soreceive, sopoll, pru_sendpipe_notsupp
};
Index: netinet/raw_ip.c
===================================================================
RCS file: /home/ncvs/src/sys/netinet/raw_ip.c,v
retrieving revision 1.64.2.6
diff -u -r1.64.2.6 raw_ip.c
--- netinet/raw_ip.c 2001/07/03 11:01:46 1.64.2.6
+++ netinet/raw_ip.c 2001/07/13 03:59:56
@@ -680,5 +680,5 @@
pru_connect2_notsupp, in_control, rip_detach, rip_disconnect,
pru_listen_notsupp, in_setpeeraddr, pru_rcvd_notsupp,
pru_rcvoob_notsupp, rip_send, pru_sense_null, rip_shutdown,
- in_setsockaddr, sosend, soreceive, sopoll
+ in_setsockaddr, sosend, soreceive, sopoll, pru_sendpipe_notsupp
};
Index: netinet/tcp_input.c
===================================================================
RCS file: /home/ncvs/src/sys/netinet/tcp_input.c,v
retrieving revision 1.107.2.15
diff -u -r1.107.2.15 tcp_input.c
--- netinet/tcp_input.c 2001/07/08 02:21:43 1.107.2.15
+++ netinet/tcp_input.c 2001/07/15 09:23:07
@@ -132,6 +132,14 @@
&drop_synfin, 0, "Drop TCP packets with SYN+FIN set");
#endif

+int tcp_send_dynamic_enable = 0;
+SYSCTL_INT(_net_inet_tcp, OID_AUTO, tcp_send_dynamic_enable, CTLFLAG_RW,
+ &tcp_send_dynamic_enable, 0, "enable dynamic control of sendspace");
+int tcp_send_dynamic_min = 4096;
+SYSCTL_INT(_net_inet_tcp, OID_AUTO, tcp_send_dynamic_min, CTLFLAG_RW,
+ &tcp_send_dynamic_min, 0, "set minimum dynamic buffer space");
+
+
struct inpcbhead tcb;
#define tcb6 tcb /* for KAME src sync over BSD*'s */
struct inpcbinfo tcbinfo;
@@ -142,8 +150,9 @@
struct tcphdr *, struct mbuf *, int));
static int tcp_reass __P((struct tcpcb *, struct tcphdr *, int *,
struct mbuf *));
-static void tcp_xmit_timer __P((struct tcpcb *, int));
+static void tcp_xmit_timer __P((struct tcpcb *, int, tcp_seq));
static int tcp_newreno __P((struct tcpcb *, struct tcphdr *));
+static void tcp_ack_dynamic_cwnd(struct tcpcb *tp, struct socket *so);

/* Neighbor Discovery, Neighbor Unreachability Detection Upper layer hint. */
#ifdef INET6
@@ -931,12 +940,16 @@
tp->snd_nxt = tp->snd_max;
tp->t_badrxtwin = 0;
}
- if ((to.to_flag & TOF_TS) != 0)
- tcp_xmit_timer(tp,
- ticks - to.to_tsecr + 1);
- else if (tp->t_rtttime &&
- SEQ_GT(th->th_ack, tp->t_rtseq))
- tcp_xmit_timer(tp, ticks - tp->t_rtttime);
+ /*
+ * note: do not include a sequence number
+ * for anything but t_rtttime timings, see
+ * tcp_xmit_timer().
+ */
+ if (tp->t_rtttime &&
+ SEQ_GT(th->th_ack, tp->t_rtseq))
+ tcp_xmit_timer(tp, tp->t_rtttime, tp->t_rtseq);
+ else if ((to.to_flag & TOF_TS) != 0)
+ tcp_xmit_timer(tp, to.to_tsecr - 1, 0);
acked = th->th_ack - tp->snd_una;
tcpstat.tcps_rcvackpack++;
tcpstat.tcps_rcvackbyte += acked;
@@ -1927,11 +1940,14 @@
* Since we now have an rtt measurement, cancel the
* timer backoff (cf., Phil Karn's retransmit alg.).
* Recompute the initial retransmit timer.
+ *
+ * note: do not include a sequence number for anything
+ * but t_rtttime timings, see tcp_xmit_timer().
*/
- if (to.to_flag & TOF_TS)
- tcp_xmit_timer(tp, ticks - to.to_tsecr + 1);
- else if (tp->t_rtttime && SEQ_GT(th->th_ack, tp->t_rtseq))
- tcp_xmit_timer(tp, ticks - tp->t_rtttime);
+ if (tp->t_rtttime && SEQ_GT(th->th_ack, tp->t_rtseq))
+ tcp_xmit_timer(tp, tp->t_rtttime, tp->t_rtseq);
+ else if (to.to_flag & TOF_TS)
+ tcp_xmit_timer(tp, to.to_tsecr - 1, 0);

/*
* If all outstanding data is acked, stop retransmit
@@ -1955,25 +1971,40 @@

/*
* When new data is acked, open the congestion window.
- * If the window gives us less than ssthresh packets
- * in flight, open exponentially (maxseg per packet).
- * Otherwise open linearly: maxseg per window
- * (maxseg^2 / cwnd per packet).
- */
- {
- register u_int cw = tp->snd_cwnd;
- register u_int incr = tp->t_maxseg;
-
- if (cw > tp->snd_ssthresh)
- incr = incr * incr / cw;
- /*
+ * We no longer use ssthresh because it just does not work
+ * right. Instead we try to avoid packet loss alltogether
+ * by avoiding excessive buffering of packet data in the
+ * network.
+ *
* If t_dupacks != 0 here, it indicates that we are still
* in NewReno fast recovery mode, so we leave the congestion
* window alone.
*/
- if (tcp_do_newreno == 0 || tp->t_dupacks == 0)
- tp->snd_cwnd = min(cw + incr,TCP_MAXWIN<<tp->snd_scale);
+
+ if (tcp_do_newreno == 0 || tp->t_dupacks == 0) {
+ if (tp->t_txbandwidth && tcp_send_dynamic_enable) {
+ tcp_ack_dynamic_cwnd(tp, so);
+ } else {
+ int incr = tp->t_maxseg;
+ if (tp->snd_cwnd > tp->snd_ssthresh)
+ incr = incr * incr / tp->snd_cwnd;
+ tp->snd_cwnd += incr;
+ }
+ /*
+ * Enforce the minimum and maximum congestion window.
+ * Remember, this whole section is hit when we get a
+ * good ack so our window is at least 2 packets.
+ */
+ if (tp->snd_cwnd > (TCP_MAXWIN << tp->snd_scale))
+ tp->snd_cwnd = TCP_MAXWIN << tp->snd_scale;
+ if (tp->snd_cwnd < tp->t_maxseg * 2)
+ tp->snd_cwnd = tp->t_maxseg * 2;
}
+
+ /*
+ * Clean out buffered transmit data that we no longer need
+ * to keep around.
+ */
if (acked > so->so_snd.sb_cc) {
tp->snd_wnd -= so->so_snd.sb_cc;
sbdrop(&so->so_snd, (int)so->so_snd.sb_cc);
@@ -2531,19 +2562,135 @@
panic("tcp_pulloutofband");
}

+/*
+ * Dynamically adjust the congestion window. The sweet spot is slightly
+ * higher then the point where the bandwidth begins to degrade. Beyond
+ * that and the extra packets wind up being buffered in the network.
+ *
+ * We use an assymetric algorithm. We increase the window until we see
+ * a 5% increase the round-trip-time (SRTT). We then assume that this is
+ * the saturation point and decrease the window until we see a loss in
+ * bandwidth.
+ *
+ * This routine is master-timed off the round-trip time of the packet,
+ * allowing us to count round trips. Since bandwidth changes need at
+ * least an rtt cycle to occur, this is much better then counting packets
+ * and should be independant of bandwidth, pipe size, etc...
+ */
+
+#define CWND_COUNT_START 2*1
+#define CWND_COUNT_DECR 2*3
+#define CWND_COUNT_INCR (CWND_COUNT_DECR + 2*8)
+#define CWND_COUNT_STABILIZED (CWND_COUNT_INCR + 2*4)
+#define CWND_COUNT_IMPROVING (CWND_COUNT_STABILIZED + 2*2)
+#define CWND_COUNT_NOT_IMPROVING (CWND_COUNT_IMPROVING + 2*8)
+
+static void
+tcp_ack_dynamic_cwnd(struct tcpcb *tp, struct socket *so)
+{
+ /*
+ * Make adjustments only at every complete round trip.
+ */
+ if ((tp->t_txbwcount & 1) == 0)
+ return;
+ ++tp->t_txbwcount;
+ if (tp->t_txbwcount == CWND_COUNT_START) {
+ /*
+ * Set a rtt performance loss target of 20%
+ */
+ tp->t_last_txbandwidth = tp->t_srtt + tp->t_srtt / 5;
+ } else if (tp->t_txbwcount >= CWND_COUNT_DECR &&
+ tp->t_txbwcount < CWND_COUNT_INCR &&
+ tp->t_srtt < tp->t_last_txbandwidth) {
+ /*
+ * Increase cwnd in maxseg chunks until we hit our target.
+ * The target represents the point where packets are starting
+ * to be buffered significantly in the network.
+ */
+ tp->snd_cwnd += tp->t_maxseg;
+ tp->t_txbwcount = CWND_COUNT_START;
+
+ /*
+ * snap target, required to avoid oscillation at high
+ * bandwidths
+ */
+ if (tp->t_last_txbandwidth > tp->t_srtt + tp->t_srtt / 5)
+ tp->t_last_txbandwidth = tp->t_srtt + tp->t_srtt / 5;
+ /*
+ * Switch directions if we hit the top.
+ */
+ if (tp->snd_cwnd >= so->so_snd.sb_hiwat ||
+ tp->snd_cwnd >= (TCP_MAXWIN << tp->snd_scale)) {
+ tp->snd_cwnd = min(so->so_snd.sb_hiwat, (TCP_MAXWIN << tp->snd_scale));
+ tp->t_txbwcount = CWND_COUNT_INCR - 2;
+ }
+ } else if (tp->t_txbwcount == CWND_COUNT_INCR) {
+ /*
+ * We hit 5% performance loss. Do nothing (wait until
+ * we stabilize).
+ */
+ } else if (tp->t_txbwcount == CWND_COUNT_STABILIZED) {
+ /*
+ * srtt started to go up, we are at the pipe limit and
+ * must be at the maximum bandwidth. Reduce the window
+ * size until we loose 5% of our bandwidth. Use smaller
+ * chunks to avoid overshooting.
+ */
+ tp->t_last_txbandwidth = tp->t_txbandwidth - tp->t_txbandwidth / 20;
+ tp->snd_cwnd -= tp->t_maxseg / 3;
+ } else if (tp->t_txbwcount >= CWND_COUNT_IMPROVING &&
+ tp->t_txbandwidth > tp->t_last_txbandwidth) {
+ /*
+ * We saw an improvement, bump the window again, loop this
+ * state. If the pipeline isn't full then adding another
+ * packet should improve bandwidth by t_maxseg. Use seg / 4
+ * to deal with any noise.
+ */
+ tp->snd_cwnd -= tp->t_maxseg / 3;
+
+ /*
+ * snap target, required to avoid oscillation at high
+ * bandwidths
+ */
+ tp->t_txbwcount = CWND_COUNT_STABILIZED;
+ if (tp->t_last_txbandwidth < tp->t_txbandwidth - tp->t_txbandwidth / 20)
+ tp->t_last_txbandwidth = tp->t_txbandwidth - tp->t_txbandwidth / 20;
+ /*
+ * Switch directions if we hit bottom.
+ */
+ if (tp->snd_cwnd < tcp_send_dynamic_min ||
+ tp->snd_cwnd <= tp->t_maxseg * 2) {
+ tp->snd_cwnd = max(tcp_send_dynamic_min, tp->t_maxseg);
+ tp->t_txbwcount = 0;
+ }
+ } else if (tp->t_txbwcount >= CWND_COUNT_NOT_IMPROVING) {
+ /*
+ * No improvement, start upward again. loop to recalculate
+ * the -5%. We can recalculate immediately and do not require
+ * additional stabilization time.
+ */
+ tp->snd_cwnd += tp->t_maxseg / 2;
+ tp->t_txbwcount = 0;
+ }
+}
+
/*
- * Collect new round-trip time estimate
- * and update averages and current timeout.
+ * Collect new round-trip time estimate and update averages, current timeout,
+ * and transmit bandwidth.
*/
static void
-tcp_xmit_timer(tp, rtt)
+tcp_xmit_timer(tp, rtttime, rtseq)
register struct tcpcb *tp;
- int rtt;
+ int rtttime;
+ tcp_seq rtseq;
{
- register int delta;
+ int delta;
+ int rtt;

tcpstat.tcps_rttupdated++;
tp->t_rttupdated++;
+
+ rtt = ticks - rtttime;
if (tp->t_srtt != 0) {
/*
* srtt is stored as fixed point with 5 bits after the
@@ -2582,8 +2729,30 @@
tp->t_srtt = rtt << TCP_RTT_SHIFT;
tp->t_rttvar = rtt << (TCP_RTTVAR_SHIFT - 1);
}
- tp->t_rtttime = 0;
tp->t_rxtshift = 0;
+
+ /*
+ * Calculate the transmit-side throughput, in bytes/sec. This is
+ * used to dynamically size the congestion window to the pipe. We
+ * average over 2 packets only. rtseq is only passed for t_rtttime
+ * based timings, which in turn only occur on an interval close to
+ * the round trip time of the packet. We have to do this in order
+ * to get accurate bandwidths without having to take a long term
+ * average, which blows up the dynamic windowing algorithm.
+ */
+ if (rtseq && rtt) {
+ tp->t_rtttime = 0;
+ if (tp->t_last_rtseq) {
+ int bw;
+
+ bw = (rtseq - tp->t_last_rtseq) * hz / rtt;
+ bw = (tp->t_txbandwidth + bw) / 2;
+ tp->t_txbandwidth = bw;
+ tp->t_txbwcount |= 1;
+ }
+ tp->t_last_rtseq = rtseq;
+ tp->t_last_rtttime = rtttime;
+ }

/*
* the retransmit should happen at rtt + 4 * rttvar.
Index: netinet/tcp_usrreq.c
===================================================================
RCS file: /home/ncvs/src/sys/netinet/tcp_usrreq.c,v
retrieving revision 1.51.2.7
diff -u -r1.51.2.7 tcp_usrreq.c
--- netinet/tcp_usrreq.c 2001/07/08 02:21:44 1.51.2.7
+++ netinet/tcp_usrreq.c 2001/07/15 05:31:52
@@ -494,6 +494,47 @@
}

/*
+ * Calculate the optimal transmission pipe size. This is used to limit the
+ * amount of data we allow to be buffered in order to reduce memory use,
+ * allowing connections to dynamically adjust to the bandwidth product of
+ * their links.
+ *
+ * For tcp we return approximately the congestion window size, which
+ * winds up being the bandwidth delay product in a lossless environment.
+ */
+static int
+tcp_usr_sendpipe(struct socket *so)
+{
+ struct inpcb *inp;
+ int size = so->so_snd.sb_hiwat;
+
+ if (tcp_send_dynamic_enable && (inp = sotoinpcb(so)) != NULL) {
+ struct tcpcb *tp;
+
+ if ((tp = intotcpcb(inp)) != NULL) {
+ size = tp->snd_cwnd;
+ if (size > tp->snd_wnd)
+ size = tp->snd_wnd;
+
+ /*
+ * debugging & minimum transmit buffer availability
+ */
+ if (tcp_send_dynamic_enable > 1) {
+ static int last_hz;
+
+ if (last_hz != ticks / hz) {
+ last_hz = ticks / hz;
+ printf("tcp_user_sendpipe: size=%d bw=%d lbw=%d count=%d srtt=%d\n", size, tp->t_txbandwidth, tp->t_last_txbandwidth, tp->t_txbwcount, tp->t_srtt);
+ }
+ }
+ if (size < tcp_send_dynamic_min)
+ size = tcp_send_dynamic_min;
+ }
+ }
+ return(size);
+}
+
+/*
* Do a send by putting data in output queue and updating urgent
* marker if URG set. Possibly send more data. Unlike the other
* pru_*() routines, the mbuf chains are our responsibility. We
@@ -674,7 +715,7 @@
tcp_usr_connect, pru_connect2_notsupp, in_control, tcp_usr_detach,
tcp_usr_disconnect, tcp_usr_listen, in_setpeeraddr, tcp_usr_rcvd,
tcp_usr_rcvoob, tcp_usr_send, pru_sense_null, tcp_usr_shutdown,
- in_setsockaddr, sosend, soreceive, sopoll
+ in_setsockaddr, sosend, soreceive, sopoll, tcp_usr_sendpipe
};

#ifdef INET6
@@ -683,7 +724,7 @@
tcp6_usr_connect, pru_connect2_notsupp, in6_control, tcp_usr_detach,
tcp_usr_disconnect, tcp6_usr_listen, in6_mapped_peeraddr, tcp_usr_rcvd,
tcp_usr_rcvoob, tcp_usr_send, pru_sense_null, tcp_usr_shutdown,
- in6_mapped_sockaddr, sosend, soreceive, sopoll
+ in6_mapped_sockaddr, sosend, soreceive, sopoll, tcp_usr_sendpipe
};
#endif /* INET6 */

Index: netinet/tcp_var.h
===================================================================
RCS file: /home/ncvs/src/sys/netinet/tcp_var.h,v
retrieving revision 1.56.2.7
diff -u -r1.56.2.7 tcp_var.h
--- netinet/tcp_var.h 2001/07/08 02:21:44 1.56.2.7
+++ netinet/tcp_var.h 2001/07/15 07:25:48
@@ -95,6 +95,7 @@
#define TF_SENDCCNEW 0x08000 /* send CCnew instead of CC in SYN */
#define TF_MORETOCOME 0x10000 /* More data to be appended to sock */
#define TF_LQ_OVERFLOW 0x20000 /* listen queue overflow */
+#define TF_BWSCANUP 0x40000
int t_force; /* 1 if forcing out a byte */

tcp_seq snd_una; /* send unacknowledged */
@@ -128,6 +129,11 @@
u_long t_starttime; /* time connection was established */
int t_rtttime; /* round trip time */
tcp_seq t_rtseq; /* sequence number being timed */
+ int t_last_rtttime;
+ tcp_seq t_last_rtseq; /* sequence number being timed */
+ int t_txbandwidth; /* transmit bandwidth/delay */
+ int t_last_txbandwidth;
+ int t_txbwcount;

int t_rxtcur; /* current retransmit value (ticks) */
u_int t_maxseg; /* maximum segment size */
@@ -371,6 +377,8 @@
extern int tcp_do_newreno;
extern int ss_fltsz;
extern int ss_fltsz_local;
+extern int tcp_send_dynamic_enable;
+extern int tcp_send_dynamic_min;

void tcp_canceltimers __P((struct tcpcb *));
struct tcpcb *
Index: netinet/udp_usrreq.c
===================================================================
RCS file: /home/ncvs/src/sys/netinet/udp_usrreq.c,v
retrieving revision 1.64.2.11
diff -u -r1.64.2.11 udp_usrreq.c
--- netinet/udp_usrreq.c 2001/07/03 11:01:47 1.64.2.11
+++ netinet/udp_usrreq.c 2001/07/13 04:00:17
@@ -923,6 +923,6 @@
pru_connect2_notsupp, in_control, udp_detach, udp_disconnect,
pru_listen_notsupp, in_setpeeraddr, pru_rcvd_notsupp,
pru_rcvoob_notsupp, udp_send, pru_sense_null, udp_shutdown,
- in_setsockaddr, sosend, soreceive, sopoll
+ in_setsockaddr, sosend, soreceive, sopoll, pru_sendpipe_notsupp
};

Index: netinet6/raw_ip6.c
===================================================================
RCS file: /home/ncvs/src/sys/netinet6/raw_ip6.c,v
retrieving revision 1.7.2.3
diff -u -r1.7.2.3 raw_ip6.c
--- netinet6/raw_ip6.c 2001/07/03 11:01:55 1.7.2.3
+++ netinet6/raw_ip6.c 2001/07/13 04:00:25
@@ -733,5 +733,5 @@
pru_connect2_notsupp, in6_control, rip6_detach, rip6_disconnect,
pru_listen_notsupp, in6_setpeeraddr, pru_rcvd_notsupp,
pru_rcvoob_notsupp, rip6_send, pru_sense_null, rip6_shutdown,
- in6_setsockaddr, sosend, soreceive, sopoll
+ in6_setsockaddr, sosend, soreceive, sopoll, pru_sendpipe_notsupp
};
Index: netipx/ipx_usrreq.c
===================================================================
RCS file: /home/ncvs/src/sys/netipx/ipx_usrreq.c,v
retrieving revision 1.26.2.1
diff -u -r1.26.2.1 ipx_usrreq.c
--- netipx/ipx_usrreq.c 2001/02/22 09:44:18 1.26.2.1
+++ netipx/ipx_usrreq.c 2001/07/13 04:00:38
@@ -89,7 +89,7 @@
ipx_connect, pru_connect2_notsupp, ipx_control, ipx_detach,
ipx_disconnect, pru_listen_notsupp, ipx_peeraddr, pru_rcvd_notsupp,
pru_rcvoob_notsupp, ipx_send, pru_sense_null, ipx_shutdown,
- ipx_sockaddr, sosend, soreceive, sopoll
+ ipx_sockaddr, sosend, soreceive, sopoll, pru_sendpipe_notsupp
};

struct pr_usrreqs ripx_usrreqs = {
@@ -97,7 +97,7 @@
ipx_connect, pru_connect2_notsupp, ipx_control, ipx_detach,
ipx_disconnect, pru_listen_notsupp, ipx_peeraddr, pru_rcvd_notsupp,
pru_rcvoob_notsupp, ipx_send, pru_sense_null, ipx_shutdown,
- ipx_sockaddr, sosend, soreceive, sopoll
+ ipx_sockaddr, sosend, soreceive, sopoll, pru_sendpipe_notsupp
};

/*
Index: netipx/spx_usrreq.c
===================================================================
RCS file: /home/ncvs/src/sys/netipx/spx_usrreq.c,v
retrieving revision 1.27.2.1
diff -u -r1.27.2.1 spx_usrreq.c
--- netipx/spx_usrreq.c 2001/02/22 09:44:18 1.27.2.1
+++ netipx/spx_usrreq.c 2001/07/13 04:00:46
@@ -107,7 +107,7 @@
spx_connect, pru_connect2_notsupp, ipx_control, spx_detach,
spx_usr_disconnect, spx_listen, ipx_peeraddr, spx_rcvd,
spx_rcvoob, spx_send, pru_sense_null, spx_shutdown,
- ipx_sockaddr, sosend, soreceive, sopoll
+ ipx_sockaddr, sosend, soreceive, sopoll, pru_sendpipe_notsupp
};

struct pr_usrreqs spx_usrreq_sps = {
@@ -115,7 +115,7 @@
spx_connect, pru_connect2_notsupp, ipx_control, spx_detach,
spx_usr_disconnect, spx_listen, ipx_peeraddr, spx_rcvd,
spx_rcvoob, spx_send, pru_sense_null, spx_shutdown,
- ipx_sockaddr, sosend, soreceive, sopoll
+ ipx_sockaddr, sosend, soreceive, sopoll, pru_sendpipe_notsupp
};

void
Index: netkey/keysock.c
===================================================================
RCS file: /home/ncvs/src/sys/netkey/keysock.c,v
retrieving revision 1.1.2.2
diff -u -r1.1.2.2 keysock.c
--- netkey/keysock.c 2001/07/03 11:02:00 1.1.2.2
+++ netkey/keysock.c 2001/07/13 04:00:51
@@ -586,7 +586,7 @@
key_disconnect, pru_listen_notsupp, key_peeraddr,
pru_rcvd_notsupp,
pru_rcvoob_notsupp, key_send, pru_sense_null, key_shutdown,
- key_sockaddr, sosend, soreceive, sopoll
+ key_sockaddr, sosend, soreceive, sopoll, pru_sendpipe_notsupp
};

/* sysctl */
Index: netnatm/natm.c
===================================================================
RCS file: /home/ncvs/src/sys/netnatm/natm.c,v
retrieving revision 1.12
diff -u -r1.12 natm.c
--- netnatm/natm.c 2000/02/13 03:32:03 1.12
+++ netnatm/natm.c 2001/07/13 04:01:15
@@ -413,7 +413,7 @@
natm_usr_detach, natm_usr_disconnect, pru_listen_notsupp,
natm_usr_peeraddr, pru_rcvd_notsupp, pru_rcvoob_notsupp,
natm_usr_send, pru_sense_null, natm_usr_shutdown,
- natm_usr_sockaddr, sosend, soreceive, sopoll
+ natm_usr_sockaddr, sosend, soreceive, sopoll, pru_sendpipe_notsupp
};

#else /* !FREEBSD_USRREQS */
Index: sys/protosw.h
===================================================================
RCS file: /home/ncvs/src/sys/sys/protosw.h,v
retrieving revision 1.28.2.2
diff -u -r1.28.2.2 protosw.h
--- sys/protosw.h 2001/07/03 11:02:01 1.28.2.2
+++ sys/protosw.h 2001/07/13 04:02:15
@@ -228,6 +228,7 @@
struct mbuf **controlp, int *flagsp));
int (*pru_sopoll) __P((struct socket *so, int events,
struct ucred *cred, struct proc *p));
+ int (*pru_sendpipe) __P((struct socket *so));
};

int pru_accept_notsupp __P((struct socket *so, struct sockaddr **nam));
@@ -240,6 +241,7 @@
int pru_rcvd_notsupp __P((struct socket *so, int flags));
int pru_rcvoob_notsupp __P((struct socket *so, struct mbuf *m, int flags));
int pru_sense_null __P((struct socket *so, struct stat *sb));
+#define pru_sendpipe_notsupp NULL

#endif /* _KERNEL */

Index: sys/socketvar.h
===================================================================
RCS file: /home/ncvs/src/sys/sys/socketvar.h,v
retrieving revision 1.46.2.5
diff -u -r1.46.2.5 socketvar.h
--- sys/socketvar.h 2001/02/26 04:23:21 1.46.2.5
+++ sys/socketvar.h 2001/07/13 03:47:25
@@ -188,9 +188,11 @@
* still be negative (cc > hiwat or mbcnt > mbmax). Should detect
* overflow and return 0. Should use "lmin" but it doesn't exist now.
*/
-#define sbspace(sb) \
- ((long) imin((int)((sb)->sb_hiwat - (sb)->sb_cc), \
+#define sbspace_using(sb, hiwat) \
+ ((long) imin((int)((hiwat) - (sb)->sb_cc), \
(int)((sb)->sb_mbmax - (sb)->sb_mbcnt)))
+
+#define sbspace(sb) sbspace_using(sb, (sb)->sb_hiwat)

/* do we have to send all at once on a socket? */
#define sosendallatonce(so) \

Tim

unread,

Jul 15, 2001, 7:20:09 AM7/15/01

to

Cool! We were just commenting that it's too bad dummynet/ALTQ really
couldn't help the interactive response for us dial-up users. Anyway, I
just tried this on my dial-up connection on a fresh -STABLE but don't
really notice any appreciable difference.

net.inet.tcp.tcp_send_dynamic_enable: 1
net.inet.tcp.tcp_send_dynamic_min: 1024 (tried it with default 4096 too)

My ssh response is still about 3 or 4 seconds behind my typing. What
should a dial-up user expect?

Thanks!

Tim

Leo Bicknell

unread,

Jul 15, 2001, 10:34:10 AM7/15/01

to

On Sun, Jul 15, 2001 at 01:13:11AM -0700, Julian Elischer wrote:
> This is all getting a bit far from the original topic, but
> I do worry that we may increase our packet loss with variable buffers and thus
> reduce throughout in the cases where teh fixed buffer was getting 80%
> or so of the theoretical throughout.

Packet loss is not always a bad thing. Let me use an admittedly
extreme example:

Consider a backup server across country from four machines it's
trying to back up nightly. So we have high (let's say 70ms) RTT's,
and let's say for the sake of argument the limiting factor is a
DS-3 in the middle, 45 MBits/sec.

Each connection can get 16384 * 1000 / 70 = 234057 bytes/sec, or
about 1.87 Mbits/sec. Multiply by the 4 machines, and we get
network utilization of 7.48 Mbits/sec, about 16% of the DS-3.

Now, we implement some sort of code that can increase the amount
of socket buffering space. As a result, the window can grow (per
connection) large enough to fill a DS-3, so the 4 hosts must fight
for the bandwidth available.

I don't have any great math for how we get here, but TCP in normal
situations rarely produces more than 5% packet loss (10% absolute
max), since it backs off when congestion occurs. I'll go with 5%
as an upper bound. With that packet loss, TCP now gets the DS-3
much closer to full, let's say 90%, or 40.5 Mbits/sec (it should
be higher than 90%, but again, I'm worst casing). In the aggregate
that will be spread across the 4 connections evenly, or 10.12
Mbits/sec per connection.

The question to be asked is, which is better, 1.87 MBit's sec with
no packet loss, or 10.12 Mbits/sec w/5% packet loss. Clearly the
latter gives better performance, even with packet loss.

Clearly knowing the end to end link bandwidth and 'just' filling it
would be better, but packet loss, at least in the concept of TCP
flow control is not all bad. Something else to remember is not
everyone plays fair, so if we stay to 80% of available, and everyone
else pushes to packet loss we will in general be pushed out.

--
Leo Bicknell - bick...@ufp.org
Systems Engineer - Internetworking Engineer - CCIE 3440
Read TMBG List - tmbg-lis...@tmbg.org, www.tmbg.org

To Unsubscribe: send mail to majo...@FreeBSD.org

Bernd Walter

unread,

Jul 15, 2001, 11:02:18 AM7/15/01

to

On Sun, Jul 15, 2001 at 06:19:15AM -0500, Tim wrote:
> Cool! We were just commenting that it's too bad dummynet/ALTQ really
> couldn't help the interactive response for us dial-up users. Anyway, I
> just tried this on my dial-up connection on a fresh -STABLE but don't
> really notice any appreciable difference.
>
> net.inet.tcp.tcp_send_dynamic_enable: 1
> net.inet.tcp.tcp_send_dynamic_min: 1024 (tried it with default 4096 too)
>
> My ssh response is still about 3 or 4 seconds behind my typing. What
> should a dial-up user expect?

If you don't see a difference with a dial-up line you see exactly
what is expected from this - which is a good sign.
The situations where it should bring performance are different from
yours.

--
B.Walter COSMO-Project http://www.cosmo-project.de
ti...@cicely.de Usergroup in...@cosmo-project.de

Matt Dillon

unread,

Jul 15, 2001, 12:57:50 PM7/15/01

to

:
:Cool! We were just commenting that it's too bad dummynet/ALTQ really

:couldn't help the interactive response for us dial-up users. Anyway, I
:just tried this on my dial-up connection on a fresh -STABLE but don't
:really notice any appreciable difference.
:
:net.inet.tcp.tcp_send_dynamic_enable: 1
:net.inet.tcp.tcp_send_dynamic_min: 1024 (tried it with default 4096 too)
:
:My ssh response is still about 3 or 4 seconds behind my typing. What
:should a dial-up user expect?

:
:Thanks!
:
:Tim

Well, what this code does is manage the case where you are streaming
data in the transmit direction *and* trying to type at the same time
over another connection. It will not improve latency on an idle
connection that you are typing over. Even in the streaming case with
this algorithm the minimum window is two t_maxseg packets and that
will have a noticeable effect on latency over a dialup no matter what.
What this protocol is supposed to save you from, at least insofar as a
dialup goes, is that it should prevent an upload from killing terminal
performance entirely (e.g. it should prevent 10-20 second latency on
keystrokes).

3-4 seconds of latency over an idle dialup is really bad. I still
get sub-second responsiveness when I run ssh over a dialup. I always
use compression over dialups (ssh -C ...) and it makes a big
difference.

This protocol also tends to devolve into a degenerate small-buffer
case (which is what it is supposed to do) when the connection is
running over a low bandwidth high latency link. It only takes a two
or three packet window to fill the link in such cases and the minimum
is two packets.

You might be able to improve performance by negotiating a smaller MTU
(if this is a PPP connection), but no matter what you will never do
better then the normal typing performance on an idle link that
you already get.

-Matt

Matt Dillon

unread,

Jul 15, 2001, 1:05:41 PM7/15/01

to

:Packet loss is not always a bad thing. Let me use an admittedly

Well, 4 connections isn't enough to generate packet loss. All
that happens is that routers inbetween start buffering the packets.
If you had a *huge* tcp window size then the routers inbetween could
run out of packet space and then packet loss would start to occur.
Routers tend to have a lot of buffer space, though. The real killer
is run-away latencies rather then packet loss.

On the other hand, something like the experimental bandwidth delay
product code I posted would do very well running 4 connections over
such a link, because it would detect the point where the routers
start buffering the data (by noticing the increased latency) and
back-off before the packet loss occured. It doesn't care how
many connections are running in parallel. The downside is that the
algorithm becomes less stable as you increase the number of
connections going between the same two end points. The stability
in the face of lots of parallel connections is something that needs
to be tested.

Also, the algorithm is less helpful when it has to figure out the
optimal transmit buffer size for every new connection (consider a web
server). I am considering ripping out the ssthresh junk from the stack,
which does not work virtually at all, and using the route table's
ssthresh field to set the initial buffer size for the algorithm.

Leo Bicknell

unread,

Jul 15, 2001, 1:14:23 PM7/15/01

to

On Sun, Jul 15, 2001 at 10:05:16AM -0700, Matt Dillon wrote:
> Well, 4 connections isn't enough to generate packet loss. All
> that happens is that routers inbetween start buffering the packets.
> If you had a *huge* tcp window size then the routers inbetween could
> run out of packet space and then packet loss would start to occur.
> Routers tend to have a lot of buffer space, though. The real killer
> is run-away latencies rather then packet loss.

Sure it is, in a lot of cases. Keep in mind RED is becoming the
default (in paritcular one major router vendor ships with it as
the default now), so in general routers will discard packets _before_
they will buffer them.

> Also, the algorithm is less helpful when it has to figure out the
> optimal transmit buffer size for every new connection (consider a web
> server). I am considering ripping out the ssthresh junk from the stack,
> which does not work virtually at all, and using the route table's
> ssthresh field to set the initial buffer size for the algorithm.

This would probably be a big win, as in web-server type cases there are
many small connections back to back.

Now, if you want a really radical idea, how about not doing slow-start
on the second-nth connection, but starting where the previous connection
left off.

Whoa, that's loaded with issues. :-)

--
Leo Bicknell - bick...@ufp.org

Systems Engineer - Internetworking Engineer - CCIE 3440
Read TMBG List - tmbg-lis...@tmbg.org, www.tmbg.org

To Unsubscribe: send mail to majo...@FreeBSD.org

Matt Dillon

unread,

Jul 15, 2001, 1:32:59 PM7/15/01

to

:

:On Sun, Jul 15, 2001 at 10:05:16AM -0700, Matt Dillon wrote:
:> Well, 4 connections isn't enough to generate packet loss. All
:> that happens is that routers inbetween start buffering the packets.
:> If you had a *huge* tcp window size then the routers inbetween could
:> run out of packet space and then packet loss would start to occur.
:> Routers tend to have a lot of buffer space, though. The real killer
:> is run-away latencies rather then packet loss.
:
:Sure it is, in a lot of cases. Keep in mind RED is becoming the
:default (in paritcular one major router vendor ships with it as
:the default now), so in general routers will discard packets _before_
:they will buffer them.

That isn't what RED does, not really. Basically it statistically
drops packets at an ever-increasing rate as the buffer fills up.
It does NOT prevent packet buffering from occuring, and it doesn't
kick in the moment buffering is used -- the buffer has to start to
fill up significantly before RED has an effect.

I personally do not believe that RED has a future or, if it does,
that it will wind up only kicking in when queues would otherwise
start to drop packets anyway, as a fail-safe rather then as a
prime bandwidth management system.

The bandwidth delay product code kicks in instantly - before a
significant number of packets get queued at the router, so it is
effectively under RED's radar. And if packet loss does occur
NewReno takes over again, so the bandwidth delay code will not
interfere with NewReno (or whatever we do to deal with packet loss).

:> Also, the algorithm is less helpful when it has to figure out the

:> optimal transmit buffer size for every new connection (consider a web
:> server). I am considering ripping out the ssthresh junk from the stack,
:> which does not work virtually at all, and using the route table's
:> ssthresh field to set the initial buffer size for the algorithm.
:
:This would probably be a big win, as in web-server type cases there are
:many small connections back to back.
:
:Now, if you want a really radical idea, how about not doing slow-start
:on the second-nth connection, but starting where the previous connection
:left off.
:
:Whoa, that's loaded with issues. :-)
:
:--
:Leo Bicknell - bick...@ufp.org

re: slow-start. Actually I think this would work quite well in
regards to setting the initial buffer size.

-Matt

Matt Dillon

unread,

Jul 15, 2001, 3:31:06 PM7/15/01

to

:Now, we add adjustable queue sizes.. and suddenly we are overflowing the

:intermediate
:queue, and dropping packets. Since we don't have SACK we are resending
:lots of data and dropping back the window size at regular intervals. thus
:it is possible that under some situations teh adjustable buffer size
:may result in WORSE throughput.
:That brings up one thing I never liked about the current TCP,
:which is that we need to keep testing the upper window size to ensure that
:we notice if the bandwidth increases. Unfortunatly the only way we can do this
:is by
:increasing the windowsize, until we lose a packet (again).
:
:There was an interesting paper that explored loss-avoidance techniques.
:these included noticing teh increased latency that can occur when
:an intermediate node starts to become overloaded. Unfortunatly,
:usually we are not the person overloading it so us backing off
:doesn't help a lot in many cases. I did some work at whistle
:trying to predict and control remote congestion, but it was mostly useful
:when the slowest link was your local loop and didn't help much if the
:link was firther away.
:Still, it did allow interactive sessions to run in prarllel with bulk
:sessions and still get reasonable reaction times. basically I metered
:out the ACKS going the other way (out) in order to minimise the
:incoming queue size at the remote end of the incoming link. :-)

:
:This is all getting a bit far from the original topic, but

:I do worry that we may increase our packet loss with variable buffers and thus
:reduce throughout in the cases where teh fixed buffer was getting 80%
:or so of the theoretical throughout.

:
:julian

Well, it can't be worse then it is now... now it increases the window
size until it hits the sendspace limit or hits packet loss.

I tried both mechanisms... checking for the bandwidth to plateau
while increasing the window size, which didn't work very well,
and looking for the increased latency, which worked quite nicely.
When decreasing the window size checking for the latency to bottom-out
didn't work very well but checking for the bandwidth to start to
drop did. The algorithm as posted is still not very stable - I had
to use 5% hysteresis to get anything approaching a reasonable result,
but it shouldn't go off into the weeds either (I hope).

The method definitely work best when the constriction is near either
end of the pipe, i.e. like your DSL line or T1 or modem, or the
destination's DSL line or T1 or modem or whatever. When the
constriction is in the middle of the network I completely agree with
you... the algorithm breaks down. You can still figure it out
statistically, but it takes far too long to remove the noise from the
measurements.

On the otherhand, if the routers were able to insert a feedback
metric in the packet (e.g. like ttl but measuring something else),
I think the middle-of-the-network problem could be solved.

Tim

unread,

Jul 15, 2001, 7:14:52 PM7/15/01

to

Ah, I didn't realize that it only affects the transmit end - so I am
guessing it is similar to what ALTQ does?

BTW, I didn't mean to imply that it was an idle link - I saturated the
link with a download in the background while testing. I am also running
an MTU of 576 already.

Note that I could get the effect I want with Dummynet and introduce
probability packet loss on traffic other than interactive traffic but it
completely kill performance on everything else (not necessarily a bad
thing). One of my colleagues use Dummynet and allocate 1kb/s to ssh and
that seems to strive a somewhat better balance. He only turns it on when
he's downloading something though - where I'd like to find a scenario
where I can leave it on permanently.

It would be nice for us lowly dial-up users to allow some sort of transfer
on the background and still have a reasonable interactive performance -
with reasonable total throughput to boot. Maybe it's easier to get that
DSL connection (I would have already, if I don't hate our ILEC so much).

Thanks,

Tim

Luigi Rizzo

unread,

Jul 16, 2001, 10:27:25 AM7/16/01

to

> Cool! We were just commenting that it's too bad dummynet/ALTQ really
> couldn't help the interactive response for us dial-up users. Anyway, I

i haven't seen the beginning of the thread but surely both altq
and dummynet can help, with the CBQ/WFQ support.

In the case of dummynet, you can pace incoming traffic as well,
at your endpoint. This means you act after the bottleneck,
but the effect is that this way
you will delay acks, and so slow down the connection eating a lot of
bandwidth, and in the steady state this keeps the queue very
short even before the bottleneck.
Much like what products like packeteer do.

cheers
luigi

Kenneth Wayne Culver

unread,

Jul 16, 2001, 10:43:01 AM7/16/01

to

I have been testing this over a very slow (barely ever over 24000 bps due
to a crappy phone line) dial-up link, and as expected, over an idle line
there is no difference (typing in an interactive ssh session seems a
little quicker, but that could just be me). The gain comes when someone is
downloading over the link and I try to type in an interactive ssh
session. (I'm sharing the link with 1 other computer). Without the sysctl
turned on typing in the "interactive" session results in a 10-15 second
wait before anything appears on the screen; but with the sysctl turned on
the wait is 2-3 seconds. I'd say that's pretty good work :-)

Ken

Matt Dillon

unread,

Jul 16, 2001, 1:19:47 PM7/16/01

to

:i haven't seen the beginning of the thread but surely both altq

:and dummynet can help, with the CBQ/WFQ support.
:
:In the case of dummynet, you can pace incoming traffic as well,
:at your endpoint. This means you act after the bottleneck,
:but the effect is that this way
:you will delay acks, and so slow down the connection eating a lot of
:bandwidth, and in the steady state this keeps the queue very
:short even before the bottleneck.
:Much like what products like packeteer do.
:
: cheers
: luigi

I don't know much about CBQ (Class Based Queuing) and WFQ (Weighted
Fair Queueing), but my impression is that these protocols would only
effect the transmit side (like the patch I posted) and would also have
to be implemented at the router nodes rather then simply at the
end points. Of course, for a modem your end point *is* a router node
so that would probably work ok.

The patch I posted, implementing bandwidth delay product adjustments
to the transmit window, should work extremely well with a modem or
DSL line (where the bandwidth restriction occurs near the end points
rather then in the middle of the network), but again it only effects
outgoing data. I'm looking at Julian's algorithms to see if there
is a receive side solution I can implement that wouldn't conflict
with the transmit side solution.

-Matt