Increase performance of pseudo terminals?

Jef Driesen

unread,

Dec 30, 2009, 1:08:26 PM12/30/09

to

Hi,

I'm using socat and pseudo terminals, to create two virtual serial ports
linked with a null modem cable. The purpose of this setup is to allow
two applications to communicate with each other, as if they were talking
to real serial ports.

socat PTY,link=/tmp/ttyS0 PTY,link=/tmp/ttyS1

But I noticed the performance is quite bad. Trying to transmit data
takes very long. For instance transmitting 32KB takes a few minutes on
my machine (Ubuntu 9.10), while on another user's machine (Mac OS X)
it's much faster, in the order of a few seconds.

Any idea what is causing this?

I know the transfer protocol is not exactly efficient because it's using
very small blocksize (16 bytes packets with a few extra bytes for
framing), but that doesn't explain the difference on other machines.
Note that I don't have control over the transfer protocol.

Thanks in advance,

Jef

Alan Curry

unread,

Dec 30, 2009, 3:36:19 PM12/30/09

to

In article <uGM_m.1242$SK6...@newsfe28.ams2>,

Jef Driesen <jefdr...@hotmail.com.invalid> wrote:
>Hi,
>
>I'm using socat and pseudo terminals, to create two virtual serial ports
>linked with a null modem cable. The purpose of this setup is to allow
>two applications to communicate with each other, as if they were talking
>to real serial ports.
>
>socat PTY,link=/tmp/ttyS0 PTY,link=/tmp/ttyS1
>
>But I noticed the performance is quite bad. Trying to transmit data
>takes very long. For instance transmitting 32KB takes a few minutes on
>my machine (Ubuntu 9.10), while on another user's machine (Mac OS X)
>it's much faster, in the order of a few seconds.
>
>Any idea what is causing this?

termios is echoing everything back across the link. And then echoing it
again and again because both sides are in echo mode. A single byte written
should be enough to keep it busy echoing forever. Try adding ,raw,echo=0 to
the options.

The difference between operating systems might be caused by the kernel's
default termios settings for freshly created ptys.

--
Alan Curry

Jef Driesen

unread,

Dec 30, 2009, 5:31:22 PM12/30/09

to

My applications change the termios settings into raw mode, so I assume
that problem doesn't apply in my case?

David Schwartz

unread,

Jan 2, 2010, 8:36:46 PM1/2/10

to

On Dec 30 2009, 2:31 pm, Jef Driesen <jefdrie...@hotmail.com.invalid>
wrote:

> My applications change the termios settings into raw mode, so I assume
> that problem doesn't apply in my case?

No, what's happen is simply that there are a ridiculous number of
handoffs where A waits for B. Tracing it on my Linux Fedora 12 system,
I've already identified about 60(!) such handoffs (cases where A does
something and then B must do something for any further forward
progress to be made) that have to occur for each block of 128 bytes to
be sent.

DS

Jef Driesen

unread,

Jan 3, 2010, 8:50:03 AM1/3/10

to

I'm not really sure I understand what you wrote.

What socat does is a simple select() loop, waiting until data arrives on
one of the two pseudo terminals. When data is available on one side, it
is read(), and then write() back to the other one (and vice versa of
course). I would think the overhead is quite small.

Or are you referring to the communication protocol, where the sending
side needs to wait every time until the receiving side requests a packet?

David Schwartz

unread,

Jan 4, 2010, 2:39:46 AM1/4/10

to

On Jan 3, 5:50 am, Jef Driesen <jefdrie...@hotmail.com.invalid> wrote:

> What socat does is a simple select() loop, waiting until data arrives on
> one of the two pseudo terminals. When data is available on one side, it
> is read(), and then write() back to the other one (and vice versa of
> course). I would think the overhead is quite small.

> Or are you referring to the communication protocol, where the sending
> side needs to wait every time until the receiving side requests a packet?

You have to put it all together.

Sending side writes some data to the pty slave. It then has to go to
the pty master. Then socat must get a 'select' hit. Then socat does an
extra 'select' for some reason. Then socat does a read. Then socat
does a write to the other pty master. Then the data goes to the other
pty slave. Then the receiver wakes up. Then the receiver calls 'recv'.
And at this point, we're about 1/3 (!) done with the sending process
because there's a backflow on all of these and the sender passes a
flush down the line.

Then, when all that's finished, we're half done. The receiver has to
send an ACK, which also has to follow the line. The receiver also does
a flush.

Then, you'd think we'd be ready to send the next block, but you'd be
wrong. The sender does an extra flush after receiving the ACK in case
there was some line noise that might corrupt the beginning of the
block.

It's a worst case scenario all around. It just plain shouldn't be
done.

DS

Jef Driesen

unread,

Jan 4, 2010, 10:11:44 AM1/4/10

to

I didn't think about the master<->slave flow.

What do you mean with 'backflow' and 'flush'?

I perfectly understand there is some performance penalty to pay for the
socat setup, but I expected it to be less. Especially because other
people seem to get much better performance. Factors like 10x and even
100x times faster is a big difference.

David Schwartz

unread,

Jan 4, 2010, 10:19:08 AM1/4/10

to

On Jan 4, 7:11 am, Jef Driesen <jefdrie...@hotmail.com.invalid> wrote:

> I didn't think about the master<->slave flow.

I think that's what's causing the disaster.

> What do you mean with 'backflow' and 'flush'?

When sx sends the data, the control flow has to get back to sx before
it do anything else. That requires 'unthreading' your way through the
path you just followed to get the data from sx to rx.

> I perfectly understand there is some performance penalty to pay for the
> socat setup, but I expected it to be less. Especially because other
> people seem to get much better performance. Factors like 10x and even
> 100x times faster is a big difference.

Here's a high-level view from my machine. The fields are:
1) Relative time operation complete. (In seconds.)
2) Relative time operation started. (In seconds.)
3) Program, SO=socat, SX=sx, RX=sx
4) System call, parameters, = return value

.094439 .094431 SX read(3, "15: DATA..."..., 128) =
128

.094494 .094474 SX write(1, "\1G\27015: DATA..."..., 132) =
132
.095243 .093526 SO select(6, [3 5], [], [], NULL) = 1 (in [3])
.095280 .095271 SO read(3, "\1G\27015: DATA..."..., 8192) =
132

.095324 .095314 SO write(5, "\1G\27015: DATA..."..., 132) =
132
.095372 .095362 SO select(6, [3 5], [5], [], NULL) = 1 (out [5])
.096358 .092610 RX read(0, "\1G\27015: DATA..."..., 8192) =
132

.096442 .096431 RX write(1, "\6", 1) = 1
.097352 .095410 SO select(6, [3 5], [], [], NULL) = 1 (in [5])
.097400 .097391 SO read(5, "\6", 8192) = 1

.097439 .097429 SO write(3, "\6", 1) = 1
.098364 .094614 SX read(0, "\6", 128) = 1

If you notice, the main delay seems to be between when the data is
written to the pty slave and when the process 'select'ing on the pty
master wakes up.

DS

David Schwartz

unread,

Jan 4, 2010, 10:20:44 AM1/4/10

to

On Jan 4, 7:19 am, David Schwartz <dav...@webmaster.com> wrote:

> If you notice, the main delay seems to be between when the data is
> written to the pty slave and when the process 'select'ing on the pty
> master wakes up.

Which, by the way, means you were probably right. This seems to be a
performance issue with the pty subsystem, possibly interacting with
the scheduler.

DS

David Schwartz

unread,

Jan 4, 2010, 11:09:54 AM1/4/10

to

Aha!

/**
* tty_schedule_flip - push characters to ldisc
* @tty: tty to push from
*
* Takes any pending buffers and transfers their ownership to the
* ldisc side of the queue. It then schedules those characters
for
* processing by the line discipline.
*
* Locking: Takes tty->buf.lock
*/

void tty_schedule_flip(struct tty_struct *tty)
{
unsigned long flags;
spin_lock_irqsave(&tty->buf.lock, flags);
if (tty->buf.tail != NULL)
tty->buf.tail->commit = tty->buf.tail->used;
spin_unlock_irqrestore(&tty->buf.lock, flags);
schedule_delayed_work(&tty->buf.work, 1);
}
EXPORT_SYMBOL(tty_schedule_flip);

This is the problem. This code specifically asks that the wakeup be
delayed one jiffy. Changing the "1" to a "0" should eliminate the
problem, though performance might we worse for "bulk data" cases.
(Consider a program that 'dribbles' bytes into the pty.)

My bet is the coder assumed that ttys would always be much slower than
the tick rate and so it made sense to accumulate characters rather
than scheduling a consumer more than once in a single tick.

DS

Jef Driesen

unread,

Jan 4, 2010, 5:40:29 PM1/4/10

to

I think writing byte by byte is a bad idea anyway. Or will the
performance also suffer when larger amounts of data are written at once?
I suppose not.

I patched the ubuntu kernel, but I don't notice any change :-(
(I don't have much experience building kernels, but I think I did it right.)

David Schwartz

unread,

Jan 4, 2010, 9:31:57 PM1/4/10

to

On Jan 4, 2:40 pm, Jef Driesen <jefdrie...@hotmail.com.invalid> wrote:

> I think writing byte by byte is a bad idea anyway. Or will the
> performance also suffer when larger amounts of data are written at once?
> I suppose not.

I don't think so. However, it's also possible that this code hides a
race condition. For example, what if the code looks like this:

1) Do some processing.
2) Schedule to wake the other end after a delay.
3) Do some more processing.

Here, removing the delay will cause the other process to sometimes run
before we do the processing in step 3. The delay may actually be
needed to ensure this doesn't happen.

> I patched the ubuntu kernel, but I don't notice any change :-(
> (I don't have much experience building kernels, but I think I did it right.)

Are you sure you installed and ran your new kernel? The 'uname -v'
command will tell you.

DS

Jef Driesen

unread,

Jan 5, 2010, 3:42:03 AM1/5/10

to

Yes, I rebuild the ubuntu kernel packages, following the instructions on
this page [1] (I used the apt-get source method). When installing the
resulting deb package, the previously installed Ubuntu package got
replaced (and I installed an older kernel too, in case something went
wrong). After rebooting, uname confirms I'm running the rebuild kernel:

$uname -v
#53 SMP Mon Jan 4 21:00:47 CET 2010

[1] https://help.ubuntu.com/community/Kernel/Compile

David Schwartz

unread,

Jan 5, 2010, 9:06:29 AM1/5/10

to

On Jan 5, 12:42 am, Jef Driesen <jefdrie...@hotmail.com.invalid>
wrote:

> Yes, I rebuild the ubuntu kernel packages, following the instructions on

> this page [1] (I used the apt-get source method). When installing the
> resulting deb package, the previously installed Ubuntu package got
> replaced (and I installed an older kernel too, in case something went
> wrong). After rebooting, uname confirms I'm running the rebuild kernel:

*sigh* It may be that it rounds up and zero and one have the same
effect. I think you can change:
schedule_delayed_work(&tty->buf.work, 1);
to:
flush_to_ldisc(&tty->buf.work.work);
And it will process it right then and there. But I'm not really sure
it's safe.

DS

Rainer Weikusat

unread,

Jan 5, 2010, 9:56:57 AM1/5/10

to

Some discussion of this is available here

http://lkml.indiana.edu/hypermail/linux/kernel/0707.2/index.html#1566

as part of the 'Use tty_schedule in VT code'-thread. The 2.6.32.2
kernel I am presently using apparently includes a 'tty low latency
mode' and the code there looks like this:

void tty_flip_buffer_push(struct tty_struct *tty)

{
unsigned long flags;
spin_lock_irqsave(&tty->buf.lock, flags);
if (tty->buf.tail != NULL)
tty->buf.tail->commit = tty->buf.tail->used;
spin_unlock_irqrestore(&tty->buf.lock, flags);

if (tty->low_latency)
flush_to_ldisc(&tty->buf.work.work);
else
schedule_delayed_work(&tty->buf.work, 1);
}

And the comment above it warns that 'This function must not be called
from IRQ context if tty->low_latency is set' (tty_flip_buffer_push is
called from pty_write).

Jef Driesen

unread,

Jan 5, 2010, 4:40:15 PM1/5/10

to

I tried this, but it doesn't seem to have any effect either. Now I start
to wonder if I'm doing something wrong with the patching, or is there
really no effect?

Rainer Weikusat

unread,

Jan 6, 2010, 5:29:34 AM1/6/10

to

Are you sure that you are patching the subroutine which is actually
called? At least the 2.6.32 pty-code doesn't call tty_schedule_flip.

Jef Driesen

unread,

Jan 6, 2010, 11:41:12 AM1/6/10

to

That might be the problem! I did patch the tty_schedule_flip() function.
My kernel is 2.6.31-16. Do you happen to know which function needs to be
patched? Maybe tty_flip_buffer_push()? It's identical, except that it
uses the low_latency flag to choose between schedule_delayed_work and
flush_to_ldisc.

See the attachments for the patches I tried. In case the newsgroup
doesn't allow attachments, they can be downloaded on my website too:

http://www.subaquaclub.be/users/jefdriesen/tmp/tty_v1.patch
http://www.subaquaclub.be/users/jefdriesen/tmp/tty_v2.patch

Rainer Weikusat

unread,

Jan 8, 2010, 2:19:08 PM1/8/10

to

Jef Driesen <jefdr...@hotmail.com.invalid> writes:
> On 6/01/2010 11:29, Rainer Weikusat wrote:
>> Jef Driesen<jefdr...@hotmail.com.invalid> writes:
>>> On 05/01/10 15:06, David Schwartz wrote:

[ tty_schedule_flip ]

>>>> *sigh* It may be that it rounds up and zero and one have the same
>>>> effect. I think you can change:
>>>> schedule_delayed_work(&tty->buf.work, 1);
>>>> to:
>>>> flush_to_ldisc(&tty->buf.work.work);
>>>> And it will process it right then and there. But I'm not really sure
>>>> it's safe.
>>>
>>> I tried this, but it doesn't seem to have any effect either. Now I
>>> start to wonder if I'm doing something wrong with the patching, or is
>>> there really no effect?
>>
>> Are you sure that you are patching the subroutine which is actually
>> called? At least the 2.6.32 pty-code doesn't call tty_schedule_flip.
>
> That might be the problem! I did patch the tty_schedule_flip()
> function. My kernel is 2.6.31-16. Do you happen to know which function
> needs to be patched? Maybe tty_flip_buffer_push()? It's identical,
> except that it uses the low_latency flag to choose between
> schedule_delayed_work and flush_to_ldisc.

The pty_write routine ('hardware write routine' for ptys) calls
tty_flip_buffer_push (at least in 2.6.32). But before changing that,
I'd try to enable low latency mode for ptys (by setting
tty->low_latency to something != 0 in pty_open, for instance). Should
this help, I'd consider adding an ioctl (to pty_unix98_ioctl) to
enable/disable 'low latency mode' for ptys.

Jef Driesen

unread,

Jan 11, 2010, 8:12:51 AM1/11/10

to

I patched the pty_open() function to set tty->low_latency=1. With this
change, transfers are significant faster, but I also get a number of
timeout errors in the test:

$ socat PTY,link=/tmp/ttyS0,raw,echo=0 PTY,link=/tmp/ttyS1,raw,echo=0 &
$ sx input.bin >>/tmp/ttyS0 </tmp/ttyS0
Sending input.bin, 4206 blocks: Give your local XMODEM receive command now.
Xmodem sectors/kbytes sent: 462/57kRetry 0: NAK on sector
Xmodem sectors/kbytes sent: 545/68kRetry 0: NAK on sector
Xmodem sectors/kbytes sent: 956/119kRetry 0: NAK on sector
Xmodem sectors/kbytes sent: 2369/296kRetry 0: NAK on sector
Xmodem sectors/kbytes sent: 2657/332kRetry 0: NAK on sector
Bytes Sent: 538368 BPS:15634

Transfer complete

$time rx output.bin >/tmp/ttyS1 </tmp/ttyS1

rx: ready to receive output.bin
Retry 0: TIMEOUT 206
Retry 0: TIMEOUT 335
Retry 0: TIMEOUT 188
Retry 0: TIMEOUT 655
Retry 0: TIMEOUT 975
Bytes received: 538368 BPS:17089

Transfer complete

real 0m31.509s
user 0m0.020s
sys 0m0.130s

The BPS varied between 9000 and 17000, but it is definitely an
improvement over the 3000 I got without the patch.

> Should
> this help, I'd consider adding an ioctl (to pty_unix98_ioctl) to
> enable/disable 'low latency mode' for ptys.

I suppose the idea is then to call this ioctl from my userspace code? So
the low latency mode is only enabled for my application?

David Schwartz

unread,

Jan 11, 2010, 1:41:13 PM1/11/10

to

On Jan 11, 5:12 am, Jef Driesen <jefdrie...@hotmail.com.invalid>
wrote:

> I patched the pty_open() function to set tty->low_latency=1. With this

> change, transfers are significant faster, but I also get a number of
> timeout errors in the test:

My understanding (going from memory, and it was weak to begin with)
was that all TTYs associated with fast, modern hardware would have the
low_latency flag set. So it should be on for the whole pty system,
IMO. However, I vaguely remember there being bugs. :(

> $ socat PTY,link=/tmp/ttyS0,raw,echo=0 PTY,link=/tmp/ttyS1,raw,echo=0 &
> $ sx input.bin >>/tmp/ttyS0 </tmp/ttyS0
> Sending input.bin, 4206 blocks: Give your local XMODEM receive command now.
> Xmodem sectors/kbytes sent: 462/57kRetry 0: NAK on sector
> Xmodem sectors/kbytes sent: 545/68kRetry 0: NAK on sector
> Xmodem sectors/kbytes sent: 956/119kRetry 0: NAK on sector
> Xmodem sectors/kbytes sent: 2369/296kRetry 0: NAK on sector
> Xmodem sectors/kbytes sent: 2657/332kRetry 0: NAK on sector

> Retry 0: TIMEOUT 206
> Retry 0: TIMEOUT 335
> Retry 0: TIMEOUT 188
> Retry 0: TIMEOUT 655
> Retry 0: TIMEOUT 975
> Bytes received: 538368 BPS:17089

That's scary. On the bright side, you've found an easy to replicate
test case that demonstrates a bug in the pty layer, I think. :((

DS

Jef Driesen

unread,

Jan 12, 2010, 4:18:42 AM1/12/10

to

On 11/01/2010 19:41, David Schwartz wrote:
> On Jan 11, 5:12 am, Jef Driesen<jefdrie...@hotmail.com.invalid>
> wrote:
>
>> I patched the pty_open() function to set tty->low_latency=1. With this
>> change, transfers are significant faster, but I also get a number of
>> timeout errors in the test:
>
> My understanding (going from memory, and it was weak to begin with)
> was that all TTYs associated with fast, modern hardware would have the
> low_latency flag set. So it should be on for the whole pty system,
> IMO. However, I vaguely remember there being bugs. :(

I think my Core 2 Duo still counts as modern hardware, but judging from
the speed improvement, the low_latency flag is not set by default.

>> $ socat PTY,link=/tmp/ttyS0,raw,echo=0 PTY,link=/tmp/ttyS1,raw,echo=0&

>> $ sx input.bin>>/tmp/ttyS0</tmp/ttyS0
>> Sending input.bin, 4206 blocks: Give your local XMODEM receive command now.
>> Xmodem sectors/kbytes sent: 462/57kRetry 0: NAK on sector
>> Xmodem sectors/kbytes sent: 545/68kRetry 0: NAK on sector
>> Xmodem sectors/kbytes sent: 956/119kRetry 0: NAK on sector
>> Xmodem sectors/kbytes sent: 2369/296kRetry 0: NAK on sector
>> Xmodem sectors/kbytes sent: 2657/332kRetry 0: NAK on sector
>> Retry 0: TIMEOUT 206
>> Retry 0: TIMEOUT 335
>> Retry 0: TIMEOUT 188
>> Retry 0: TIMEOUT 655
>> Retry 0: TIMEOUT 975
>> Bytes received: 538368 BPS:17089
>
> That's scary. On the bright side, you've found an easy to replicate
> test case that demonstrates a bug in the pty layer, I think. :((

I think this might be related to this commit:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=e043e42bdb66885b3ac10d27a01ccb9972e2b0a3

The patch that was reverted in that commit is almost equal to the patch
I applied!

Rainer Weikusat

unread,

Jan 15, 2010, 12:55:10 PM1/15/10

to

Jef Driesen <jefdr...@hotmail.com.invalid> writes:
> On 11/01/2010 19:41, David Schwartz wrote:
>> On Jan 11, 5:12 am, Jef Driesen<jefdrie...@hotmail.com.invalid>

[...]

>>> $ socat PTY,link=/tmp/ttyS0,raw,echo=0 PTY,link=/tmp/ttyS1,raw,echo=0&
>>> $ sx input.bin>>/tmp/ttyS0</tmp/ttyS0
>>> Sending input.bin, 4206 blocks: Give your local XMODEM receive command now.
>>> Xmodem sectors/kbytes sent: 462/57kRetry 0: NAK on sector
>>> Xmodem sectors/kbytes sent: 545/68kRetry 0: NAK on sector
>>> Xmodem sectors/kbytes sent: 956/119kRetry 0: NAK on sector
>>> Xmodem sectors/kbytes sent: 2369/296kRetry 0: NAK on sector
>>> Xmodem sectors/kbytes sent: 2657/332kRetry 0: NAK on sector
>>> Retry 0: TIMEOUT 206
>>> Retry 0: TIMEOUT 335
>>> Retry 0: TIMEOUT 188
>>> Retry 0: TIMEOUT 655
>>> Retry 0: TIMEOUT 975
>>> Bytes received: 538368 BPS:17089
>>
>> That's scary. On the bright side, you've found an easy to replicate
>> test case that demonstrates a bug in the pty layer, I think. :((
>
> I think this might be related to this commit:
>
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=e043e42bdb66885b3ac10d27a01ccb9972e2b0a3
>
> The patch that was reverted in that commit is almost equal to the
> patch I applied!

Not really. The patch was reverted because pty_write is reportedly
called from IRQ context and the flush_to_ldisc-code reportedly
acquires 'process context' locks (which can cause the caller sleep)
which is not allowed in IRQ context. Also, this modification was done
in tty_close, to flush data upon close, while the experimental
modification I suggested was intended to cause 'flushing of data'
during write. What might also help with your problem, though, is this:

index ff47907..973be2f 100644 (file)
--- a/drivers/char/n_tty.c
+++ b/drivers/char/n_tty.c
@@ -1583,6 +1583,7 @@ static int n_tty_open(struct tty_struct *tty)

static inline int input_available_p(struct tty_struct *tty, int amt)
{
+ tty_flush_to_ldisc(tty);
if (tty->icanon) {
if (tty->canon_data)
return 1;

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blobdiff;f=drivers/char/n_tty.c;h=973be2f441951ed0e68d658c1192c94524f33aff;hp=ff47907ff1bf9f9ae8ddc3e073c06ae4f3a615df;hb=e043e42bdb66885b3ac10d27a01ccb9972e2b0a3;hpb=7d3e91b8a1f5179d56a7412d4b499f2d5fc6b25d

This is the inverse operation, so to say: As opposed to pushing the
data on write, a reading process is trying to pull data before it
blocks in the kernel waiting for input to arrive.