Expect, telnet, frozen console.

Gavin McDonald

unread,

Mar 29, 2004, 2:47:53 PM3/29/04

to

Hi all.

I have Expect v5.38.0/tcl v8.4 on linux 2.4.20 - Slackware 9.0.

I have a simple task I wish to automate:
telnet into remote host, provide username/password, surrender control to
user. This script will eventually be tied into /etc/inittab so that it
can be run in multiple consoles. (Another question in its own time.) The
script is called with a character and a tty as arguments. The script,
quoted below, works but hangs in a specific situation;

#!/usr/bin/expect --
log_user 0
set stty_init "sane cs8 -istrip"
spawn -nottycopy -nottyinit telnet $HOST
set tty_spawn_id /dev/[lrange $argv 1 1]
expect login
send "$USER[lrange $argv 0 0]\r"
expect Password
send "$PASSWORD\r"
log_user 1
interact eof

For each $USER on $HOST there are [a..j] logins, named [$USERa..$USERj].
the command:
# autologin b tty2
logs in to $HOST as $USERb. the .profile for said user starts a custom
app called 'loe' which presents a login screen in 80x24 ASCII a la
ncurses. (NOT ncurses tho...) from this point, I can login, use the app,
and all the f-keys, CTRL sequences etc work fine. To back out of a
screen, the user presses ESC, which also works fine. However, when the
user presses ESC to back out of the main menu, to the login screen, All
I/O stops. The cursor disappears, and no further input is accepted. The
processes stay active however, and if I 'kill -HUP' the telnet process,
expect exits cleanly. If I telnet manually to the host, this problem
does not happen at all.

I have compared the output of 'env' for the telnet, and the expect
logins, and they are identical. It is as though the "login" screen that
is displayed sends a control character which takes expect out of the
"interact" command.

I have, on occasion, been able to avoid this by pressing ESC twice, (or
ESC,ENTER) with enough speed that they transmit before the screen
redraws. I have also tried removing "eof" from interact, but still the
problem persists.

The "set stty_init ..." line is an attempt to solve the problem as was
done here: http://tinyurl.com/ytpse which met with limited, (though
unrepeatable,) success. The line "set tty_spawn_id" was yet another
attempt to fix the problem, again unsuccessfully. (Now I leave it as an
artifact for adding the aforementioned virtual terminal support.)

I now turn to the sages at c.l.t in hopes that someone has seen this
before, or knows where to turn.

One last thing: I have been in contact with tier 1 support for the
'loe' people, and they were unable/unwilling to furnish further details
on the output of their login screen.

Thanks for your help,
Gavin McDonald.

Gavin McDonald

unread,

Mar 30, 2004, 3:01:41 PM3/30/04

to

Hi again,
further to this problem, I have captured the output of strace
both in an instance where the script hangs, and an instance
where it does not. I quote the relevant portions below:

//////////////////BROKEN//////////////////////////

--** ESC PRESSED **--
read(0, "\33", 4096) = 1
write(4, "\33", 1) = 1
select(5, [0 4], [], [0 4], NULL) = 1 (in [4])
read(4, "\33[?25l\33[5;38f "..., 4096) = 54
write(1, "\33[?25l\33[5;38f "..., 54) = 54
select(5, [0 4], [], [0 4], NULL) = 1 (in [4])
read(4, "\33[36;40m\33[36;40m\33[;10;1m\r\33[2J\33[1"..., 4096) = 4016
brk(0x806c000) = 0x806c000
write(1, "\33[36;40m\33[36;40m\33[;10;1m\r\33[2J\33[1"..., 4001) = 4001
gettimeofday({1080673605, 582028}, {480, 0}) = 0
gettimeofday({1080673605, 582229}, {480, 0}) = 0
gettimeofday({1080673605, 582445}, {480, 0}) = 0
select(5, [0], [], [0 4], {0, 0}) = 0 (Timeout)
gettimeofday({1080673605, 583390}, {480, 0}) = 0
gettimeofday({1080673605, 583592}, {480, 0}) = 0
gettimeofday({1080673605, 583786}, {480, 0}) = 0
gettimeofday({1080673605, 583971}, {480, 0}) = 0
gettimeofday({1080673605, 584215}, {480, 0}) = 0
read(4, 0x8065530, 4096) = -1 EIO (Input/output error)
--- SIGCHLD (Child exited) ---
write(1, ".\33[13;48f\33[?25h\0\33[;", 19) = 19
close(4) = 0

//////////////////WORKING//////////////////////////

--** ESC PRESSED **--
read(0, "\33", 4096) = 1
write(4, "\33", 1) = 1
select(5, [0 4], [], [0 4], NULL) = 1 (in [0])
--** ENTER PRESSED **--
read(0, "\r", 4096) = 1
write(4, "\r", 1) = 1
select(5, [0 4], [], [0 4], NULL) = 1 (in [4])
read(4, "\33[?25l\33[5;38f "..., 4096) = 54
write(1, "\33[?25l\33[5;38f "..., 54) = 54
select(5, [0 4], [], [0 4], NULL) = 1 (in [4])
read(4, "\33[36;40m\33[36;40m\33[;10;1m\r\33[2J\33[1"..., 4096) = 4095
brk(0x806c000) = 0x806c000
write(1, "\33[36;40m\33[36;40m\33[;10;1m\r\33[2J\33[1"..., 4001) = 4001
gettimeofday({1080674153, 231103}, {480, 0}) = 0
gettimeofday({1080674153, 231332}, {480, 0}) = 0
gettimeofday({1080674153, 231575}, {480, 0}) = 0
select(5, [0], [], [0 4], {0, 0}) = 0 (Timeout)
gettimeofday({1080674153, 232597}, {480, 0}) = 0
gettimeofday({1080674153, 232824}, {480, 0}) = 0
gettimeofday({1080674153, 233047}, {480, 0}) = 0
gettimeofday({1080674153, 233262}, {480, 0}) = 0
gettimeofday({1080674153, 233532}, {480, 0}) = 0
read(4, ".\33[15;48f\33[?25h", 4096) = 15
write(1, ".\33[13;48f\33[?25h\33[?25l\33[13;48f "..., 109) = 109
gettimeofday({1080674153, 234679}, {480, 0}) = 0
select(5, [0 4], [], [0 4], {0, 0}) = 0 (Timeout)
gettimeofday({1080674153, 235702}, {480, 0}) = 0
gettimeofday({1080674153, 235926}, {480, 0}) = 0
select(5, [0 4], [], [0 4], NULL) = 1 (in [0])
--** SCREEN ACCEPTS INPUT **--
read(0, "B", 4096) = 1

//////////////////END STRACES//////////////////////////

perhaps this will make the problem clearer to someone who
can help. Un fortunately, This is my first foray into
expect, and all the faqs seem to point me here.

tia,
Gavin.

In article <MPG.1ad21d456...@shawnews.vc.shawcable.net>,
ga...@advanceforklift.com says...

Don Libes

unread,

Mar 30, 2004, 5:46:01 PM3/30/04

to

From looking at your strace of the broken case, it appears that the
connection to the telnet client is being closed from the remote side.

I can't explain why this doesn't happen when running manually. I'm
suspicious of the tty parameters because of the ENTER key difference
in your working script. That certainly seems peculiar. But that
might just be a red herring.

At this point, I would apply strace to the telnet client itself. I
wouldn't be at all surprised if it's a bug in telnet. Since you're
using Linux, it would be nice if you could run telnet (recompile for
debugging if necessary) under the debugger and watch it so you can see
why it is prematurely closing the connection.

One other curious thing. Interact is noticing that telnet shut down
because we can see the "close(4)". However, it sounds like you are
saying interact never returns. Is that right? Your strace ends right
after the close. I assume there is more to the strace, yes? If not,
I would point the C debugger at Expect's close() and then just single
step from there to find out why interact doesn't notice that and
return.

Don

Gavin McDonald

unread,

Mar 30, 2004, 7:21:35 PM3/30/04

to

(oops, originally only replied to Don...)

I beleive the

--- SIGCHLD (Child exited) ---

message is from the
# kill -HUP
i issued to hangup the telnet app. Without it, the console stays
frozen, and the telnet session does not disconnect. Therefore
pointing a debugger at the close() would be ineffective. Also, when I
run telnet outside of expect, this problem does not occur at all.

The only other difference I have noticed is between

> write(1, ".\33[13;48f\33[?25h\0\33[;", 19) = 19

and

> write(1, ".\33[13;48f\33[?25h\33[?25l\33[13;48f "..., 109) = 109

where there seems to be a NULL character preceeding the escape char.
I have recently tried "interact { null send "BYE\r" }" in hopes of
catching this, but to no avail.

Also, I have compared the output of env when telnetting direct, vs
telnetting through the expect script, and the results are identical,
so while I would like to find fault in the tty settings, there seems
to be none.

If you wish, I can send you the complete strace output from both runs,
in a .tgz, I simply edited here for brevity/clarity.

Thank you,
Gavin McDonald.

In article <s6aoeqe...@peace.mel.nist.gov>, li...@nist.gov says...

Gavin McDonald

unread,

Mar 30, 2004, 7:36:05 PM3/30/04

to

another thing I would like to add, unsure if this matters;

I just did an strace with the output piped to another tty. The
console stalls at:

gettimeofday({1080673605, 584215}, {480, 0}) = 0
read(4,

and when I kill -HUP the telnet process, it continues with

0x8065530, 4096) = -1 EIO (Input/output error)
--- SIGCHLD (Child exited) ---

So expect is hanging while waiting for input from the telnet
client...? I know (from earlier problems) that the remote
application is using "custom" terminal drivers, so it may
be that expect is not expecting ;) the data it gets from the
telnet session... Maybe the input handler in expect is choking
on telnet's output?

-G

In article <s6aoeqe...@peace.mel.nist.gov>, li...@nist.gov says...

Thomas Dickey

unread,

Mar 31, 2004, 11:42:45 AM3/31/04

to

Gavin McDonald <ga...@advanceforklift.com> wrote:
> Hi again,
> further to this problem, I have captured the output of strace
> both in an instance where the script hangs, and an instance
> where it does not. I quote the relevant portions below:

> //////////////////BROKEN//////////////////////////

> --** ESC PRESSED **--
> read(0, "\33", 4096) = 1
> write(4, "\33", 1) = 1
> select(5, [0 4], [], [0 4], NULL) = 1 (in [4])
> read(4, "\33[?25l\33[5;38f "..., 4096) = 54
> write(1, "\33[?25l\33[5;38f "..., 54) = 54

well, it is demonstrably hardcoded for Linux console. The \33[5;38f"
is hvc (an unnecessary variant of cup which isn't used by terminfo).

Perhaps they're also using the broken so-called OSC for setting colors.
There's not enough detail in this trace to see, but essentially it's
a string that is sent without a terminator. That would be a string
beginning "\33]".

--
Thomas E. Dickey
http://invisible-island.net
ftp://invisible-island.net

Gavin McDonald

unread,

Mar 31, 2004, 2:23:00 PM3/31/04

to

> is hvc (an unnecessary variant of cup which isn't used by terminfo).

Can you provide a reference? I'd like to learn more about this hvc, but
google comes back empty handed...
http://www.google.com/search?q=hvc%20cup%20terminfo - 1 (unrelated)item.

> Perhaps they're also using the broken so-called OSC for setting colors.

> beginning "\33]".
I have forwarded this question to the developers of the custom app, to
find out if they do indeed use OSC sequences. If this is the case, is
telnet then to blame? Remember that running telnet from the command
line does not reproduce this errant behaviour. Perhaps expect is
looking too far into the data it relays? (rather than blindly copying it
to another file descriptor?) I may yet have to grab the sources for
expect and start looking there...

thank-you,
-G

In article <106lt85...@corp.supernews.com>,
dic...@saltmine.radix.net says...

Thomas Dickey

unread,

Mar 31, 2004, 2:56:59 PM3/31/04

to

Gavin McDonald <ga...@advanceforklift.com> wrote:

>> is hvc (an unnecessary variant of cup which isn't used by terminfo).
> Can you provide a reference? I'd like to learn more about this hvc, but
> google comes back empty handed...
> http://www.google.com/search?q=hvc%20cup%20terminfo - 1 (unrelated)item.

I noticed that the 'f' was unusual, looked in ncurses' terminfo.src and
didn't see anything likely there (you're probably not using a tvi920).
So I (wasn't on linux at the moment) looked with google for
"linux console_codes" to find a copy of the manpage. It lists
the ECMA-48 (ISO-6429) mnemonics for the control codes. ECMA-48's
probably online (I recall seeing a pdf for it some time ago), but
noting the use for it (equivalent to cup), I recalled this as one of
those features that wasn't really needed.

>> Perhaps they're also using the broken so-called OSC for setting colors.
>> beginning "\33]".

That was my thought - if they're hardcoding strings for hvc into the
application, they're probably just reading Linux' console_codes and
using whatever looks nice.

> I have forwarded this question to the developers of the custom app, to
> find out if they do indeed use OSC sequences. If this is the case, is
> telnet then to blame? Remember that running telnet from the command

telnet's innocent. But what terminal (emulator) is it running in?
I noticed this problem with Linux console last year with xterm, and
added a workaround (a resource "brokenLinuxOSC"). You might for example
be using an older xterm, from XFree86 4.3.0 - or some other application
that copied the check.

> thank-you,
> -G

--

Gavin McDonald

unread,

Mar 31, 2004, 4:30:02 PM3/31/04

to

> telnet's innocent. But what terminal (emulator) is it running in?
> I noticed this problem with Linux console last year with xterm, and
> added a workaround (a resource "brokenLinuxOSC"). You might for example

Well, this is where it gets sticky; This isn't running in a terminal
emulator. I'm in text-mode, using all 12 virtual ttys. Didn't even
install XFree86 on the box. Since you did ask though, I have another
box running KDE, and so I fired up a 'konsole' session, ssh'd from there
into the host where I am developing the expect script, then ran said
script. Same problem. So I switched to the "emergency" console on
box B, ssh'd from _there_ ran the script, and same problem. Finally
on a whim, I tried xterm->ssh->expect->telnet->app and it hung too.

This would seem to move the emulator out of the equation... The only
pattern I haven't tried would be moving the _script_ to box B, and
running it local there. However, since that involves some excess leg-
work, I'll only do it if deemed necessary. (Since, after all, I expect
it will probably fail there too.)

-G

In article <106m8kb...@corp.supernews.com>,

Thomas Dickey

unread,

Mar 31, 2004, 5:02:35 PM3/31/04

to

Gavin McDonald <ga...@advanceforklift.com> wrote:

> This would seem to move the emulator out of the equation... The only

no problem (it was worth mentioning since the topic's come up 2-3 times,
and the original description didn't quite tell me whether it was relevant).

Don Libes

unread,

Apr 1, 2004, 11:12:28 AM4/1/04

to

You're right. That NULL is strange although for a different reason.
It shouldn't be there at all because Expect strips nulls by default.
I assumed that was the reason for the discrepancy in these two lines:

read(4, "\33[36;40m\33[36;40m\33[;10;1m\r\33[2J\33[1"..., 4096) = 4016

write(1, "\33[36;40m\33[36;40m\33[;10;1m\r\33[2J\33[1"..., 4001) = 4001

Presumably, Expect is reading a long string with 15 nulls interspersed
at various spots. These may or not be significant. (Some physical
terminals need to slow down output while others use them for real
data.) If it's a Linux terminal emulator, it ought to be for real
data in which case you should add "remove_nulls 0". Why don't you try
that.

Elsewhere you mentioned that you're not using a terminal emulator. In
that case, aren't you seeing a lot of gobbledegook in the final
output? Anyway, try turning off null removal.

One other thing that looks odd is the following:

select(5, [0], [], [0 4], {0, 0}) = 0 (Timeout)

That shows Tcl timing out in a select call (the {0, 0} means "timeout
immediately if no events are pending"). But with interact, Expect
doesn't ask Tcl for a timeout so I don't understand why Tcl would be
testing for one. What you should see is something like what is seen
earlier:

select(5, [0 4], [], [0 4], NULL) = 1 (in [4])

The NULL timeval means "wait forever". So it appears that Tcl is
doing some odd. It might be worthwhile to watch the interact code as
it establishes itself with an infinite timeout and calls Tcl_DoOneEvent.

But before you do that, I recommend you upgrade! In fact, Expect's
HISTORY notes a interact/timeout related issue. I don't recall the
details but we're grasping at so many straws here that it's worth a
shot.

5/27/03 5.38.1
Dimitar Haralanov <mitko@tahoenetworks...> noted that
interact dumped core with interact { timeout 1 }

Gavin McDonald

unread,

Apr 14, 2004, 12:57:19 PM4/14/04

to

It was once posted outside the office of a past Instructor:
"Beware of programmers carrying screwdrivers, Especially Gavin."

And so the saga of frozen expect continues; In exasperation, I
cloned a working system on top of the broken system, via ghost.
After changing /etc/HOSTNAME, and the /etc/rc.d/rc.netdevice,
I tested the script, and it STILL FROZE in expect!

Since I had two machines disemboweled at my desk, I did the next
illogical thing: I swapped hard drives. Putting the cloned HD in
the box that worked, STILL WORKS! but booting in the box that
never worked still freezes.

I have since upgraded the CPU in the "bad" machine, still freezes.
The only original "bad" parts in this box are Motherboard/RAM/NIC/VIDEO.
I will now begin the gruelling task of isolating the hardware fault.

Just wanted to let you know that it seems like a hardware fault,
and to ask if anyone has ever heard of this sort of thing before?
it seems awfully weird that faulty hardware could hang a software
process, but not the whole OS.. (I would expect a hardware fault
to inevitably manifest itself in a kernel process, then *bam*. Not
so here...)

-G

In article <106m8kb...@corp.supernews.com>,

Andreas Leitgeb

unread,

Apr 14, 2004, 1:26:20 PM4/14/04

to

Gavin McDonald <ga...@advanceforklift.com> wrote:
> Just wanted to let you know that it seems like a hardware fault,
> and to ask if anyone has ever heard of this sort of thing before?
> it seems awfully weird that faulty hardware could hang a software
> process, but not the whole OS.. (I would expect a hardware fault
> to inevitably manifest itself in a kernel process, then *bam*. Not
> so here...)

I haven't followed the entire thread, but another thing that comes
to my mind that is still constant within a machine, even after
mobo, processor, ram, harddisk, PCI-cards, power unit, ... have been
swapped, is it's IP-address (well likely, that is).

Are you sure, that there isn't some nasty router inbetween that
decides that your machine is not to receive some certain packets ...

Have you tried to let your two machines swap their IP-addresses,
and/or even NICs (in case you use a dhcp-server that is beyond your
control) ?

Mike Ellenberger

unread,

Apr 16, 2004, 7:52:55 PM4/16/04

to

Gavin McDonald <ga...@advanceforklift.com> wrote in message news:<MPG.1ae70d48e...@shawnews.vc.shawcable.net>...

> It was once posted outside the office of a past Instructor:
> "Beware of programmers carrying screwdrivers, Especially Gavin."
>
> And so the saga of frozen expect continues; In exasperation, I
> cloned a working system on top of the broken system, via ghost.
> After changing /etc/HOSTNAME, and the /etc/rc.d/rc.netdevice,
> I tested the script, and it STILL FROZE in expect!
>
> Since I had two machines disemboweled at my desk, I did the next
> illogical thing: I swapped hard drives. Putting the cloned HD in
> the box that worked, STILL WORKS! but booting in the box that
> never worked still freezes.
>
> I have since upgraded the CPU in the "bad" machine, still freezes.
> The only original "bad" parts in this box are Motherboard/RAM/NIC/VIDEO.
> I will now begin the gruelling task of isolating the hardware fault.
>
> Just wanted to let you know that it seems like a hardware fault,
> and to ask if anyone has ever heard of this sort of thing before?
> it seems awfully weird that faulty hardware could hang a software
> process, but not the whole OS.. (I would expect a hardware fault
> to inevitably manifest itself in a kernel process, then *bam*. Not
> so here...)
>
>
> -G

I'm seeing a similar "hang" using Tcl8.4.5 and Expect5.39. We spawn a
shell on a remote machine, feed it a command (i.e., ypcat hosts), wait
for data. With our old version of Tcl7.5 and Expect 5.21 there is no
problem. Incomplete buffers are always completed and our patterns
match. With Expect 5.36 - 5.40 incomplete buffers are filled most of
the time so processing continues but before we can get all the data
expect hangs waiting for data that is never delivered. The
exp_internal -f xxx 0 trace shows partial data and there it sits.

At this point I don't know if it's Expect or Tcl.

Mike

Gavin McDonald

unread,

Apr 20, 2004, 12:06:26 PM4/20/04

to

> Are you sure, that there isn't some nasty router inbetween that

> Have you tried to let your two machines swap their IP-addresses,

Not the IP addresses, and there are no routers, only Switches...

In fact, I finally found a solution to this:

I switched network cards!

I have two RTL8139-based cards, neither of which worked. I switched
to an ne2k-pci card, and the problems went away. I now think that
there may be a bug in the nic driver (8139too). Or barring that, (as
I can't reproduce this error outside of the console,) the change in
memory footprint of the NIC driver is enough to move the problem to
another app... I'll find out later, I'm sure. ;)

If anyone wants to chase this further, I will gladly offer any help I
can, but as it stands, I have 'fixed' the problem. (albeit incorrectly.)

Thanks everyone for your help with this!
Gavin McDonald.

In article <slrnc7qt1...@gamma.logic.tuwien.ac.at>,
a...@gamma.logic.tuwien.ac.at says...

Andreas Leitgeb

unread,

Apr 21, 2004, 3:35:53 AM4/21/04

to

Gavin McDonald <ga...@advanceforklift.com> wrote:
> I have two RTL8139-based cards, neither of which worked. I switched
> to an ne2k-pci card, and the problems went away. I now think that
> there may be a bug in the nic driver (8139too).

As a final note, I want to add, that I'm using one rtl8139-card,
with the same 8139too driver at home (with various kernels from
2.4.18 till 2.6.5), and never had any trouble with networking
(at full 100Mbs). But maybe just because I haven't had any task
yet that I would have needed expect for ;-/

If you have some tendency to still pursue this, and if you also
have some sample code that exhibited your bug on your machine that
you can also share, I might try it. (If you posted one before,
just let me know, so I'll look for it on google)
If not, I can understand it :-)