I've recently built and installed these versions on solaris. This is
an upgrade from Tcl7.5 and Expect 5.21.
We have scripts which spawn a shell on another system, feed it a
command, and the get the data as it comes back. What I have seen with
the new version are hangs where expect is waiting for data matching a
pattern. At first I thought this was a bug in our scripts but after
reading how expect is supposed to operate it is suppose to be smart
enough to keep getting data till a pattern can be matched.
I actually added code to handle the incomplete buffer and process it.
Sometimes this works and other times it fails on the last buffer with
a signal in Tcl_Eval.
I can use 'cat' to return the same amount of data without hanging so I
didn't think it was a problem with the amount of data but I could be
wrong.
Are these know problems with either Tcl or Expect at these levels?
Are there fixes? I tried Expect 5.40 and 5.38 with similar results.
Other suggestions?
Thanks,
Mike
I do not have any help but I have seen similar behaviour with 8.4.5 and
Expect 5.39 when run on Redhat Linux 8.0. A script that works fine with
8.3.5 and Expect 5.38 hangs when I upgrade TCL and Expect. But on my
development machine everything works fine with either TCL 8.3.5 or 8.4.5 and
Expect 5.38 or 5.39. The differance here is that my development machine is
Redhat 7.3 versus 8.0.
A signal in Tcl_Eval? Gosh. Something sounds dreadfully wrong on
your system. Can you be a little more specific?
Don
Thanks for the response Don. Some corrections to my previous post.
Contrary to what I first thought, using cat to return data can create
the hang as well. I also tried an earlier version of Expect and Tcl
to see if that would help. It didn't. Something must have changed in
this area after Tcl7.5 and Expect 5.21.
I believe the signal was caused by me hitting Cntl-c to terminate the
hanging process.
I have increased match_max and 'match_max -d' to 20480 and have been
testing with a portion of the script this afternoon. This seems to be
working but I suspect if we suddenly try to retrieve more data this
will probably break the script.
Our script looks like this, I added eof and full_buffer to see if
either would get executed.
while {1} {
expect {
-i $ID -re "(\[^\r]*)(\r\r\n|\r\n)" {
append lines "$expect_out(1,string)\n"
continue
}
"$user_prompt" {
puts "prompt"
break
}
eof {
puts "\n >>> hit eof"
}
full_buffer {
puts "\n full buffer hit"
exit -1
}
}
What I have seen in the exp_internal -f trace is only part of a line,
this was run with match_max at 2000 (default)
expect: does "xxx.xx.xx.xxx\tultraclean\r\nxxx.xx.xxx.xx\trad"
(spawn_id exp4) m
atch regular expression "([^\r]*)(\r\r\n|\r\n)"? yes
expect: set expect_out(0,string) "xxx.xx.xx.xx\tultraclean\r\n"
expect: set expect_out(1,string) "xxx.xx.xx.xx\tultraclean"
expect: set expect_out(2,string) "\r\n"
expect: set expect_out(spawn_id) "exp4"
expect: set expect_out(buffer) "xxx.xx.xx.xxx\tultraclean\r\n"
expect: continuing expect
expect: does "xxx.xx.xx.xx\trad" (spawn_id exp4) match regular
expression "([^\
r]*)(\r\r\n|\r\n)"? no
"%%%"? no
sighandler: handling signal(2)
async event handler: Tcl_Eval(exit 130)
As far as I know I am not hitting full_buffer. Now this is toward the
end of the data I am retrieving (100 entries left, 6079 lines already
retrieved).
My understanding is that expect should continue to ask for data and
fill the buffer until our pattern matches. The old version of Tcl and
Expect work with our script.
I don't know if this is a bug or whether we need to add logic to our
script to handle this condition.
Thanks,
Mike
Well, that's only about 8 years worth of differences. ;)
> Our script looks like this, I added eof and full_buffer to see if
> either would get executed.
I would recommend trying something like sockspy to see exactly
what should be coming through (that expect itself might not even
be reporting).
> while {1} {
> expect {
...
Have you considered exp_continue instead of a while loop?
> My understanding is that expect should continue to ask for data and
> fill the buffer until our pattern matches. The old version of Tcl and
> Expect work with our script.
>
> I don't know if this is a bug or whether we need to add logic to our
> script to handle this condition.
I believe that this may well be a core bug new in Tcl 8.4. I've
been hearing of sporadic issues relating to IO handling getting
caught up, although almost all cases had the issue in combination
with starkits.
--
Jeff Hobbs, The Tcl Guy
http://www.ActiveState.com/, a division of Sophos
It has been a while since an upgrade. I'm new to this project...
Given time anything can be broken...
>
> > Our script looks like this, I added eof and full_buffer to see if
> > either would get executed.
>
> I would recommend trying something like sockspy to see exactly
> what should be coming through (that expect itself might not even
> be reporting).
>
> > while {1} {
> > expect {
> ...
I haven't heard of this. It's a socket trace? Where do I get it?
>
> Have you considered exp_continue instead of a while loop?
I will give this a try. If I understand, I'd used exp_continue in the
case where I expect more data and remove the while logic?
>
> > My understanding is that expect should continue to ask for data and
> > fill the buffer until our pattern matches. The old version of Tcl and
> > Expect work with our script.
> >
> > I don't know if this is a bug or whether we need to add logic to our
> > script to handle this condition.
>
> I believe that this may well be a core bug new in Tcl 8.4. I've
> been hearing of sporadic issues relating to IO handling getting
> caught up, although almost all cases had the issue in combination
> with starkits.
I tried this with Tcl8.2.1 Expect5.32 and still got the hang.
Thanks,
Mike
More information on this problem.
Here is a backtrace of the hang. It is consistently at _read().
Comments in ExpInputProc seem to indicate that the read should
complete as soon as data is available. The value of toRead seems to
change from frame 4, frame 3, and finally at frame 1. I haven't
looked at this to see why.
#0 0xff21da28 in _read () from /usr/lib/libc.so.1
#1 0x000284c8 in ExpInputProc (instanceData=0x176078,
buf=0x1784f0 "rver\r\n137.69.53.234\tpv2-mv\r\n137.69.51.200\tpacket\r\n137.69.3.186\tmarvel\r\n137.69.12.162\tcal162\r\n137.69.9.235\tbutiki\r\n137.69.5.45\tableix\r\n137.69.121.160\tbelem\r\n137.69.101.155\tubik\r\n137.69.103.118\trunt\r\n13"...,
toRead=4096, errorCodePtr=0xffbfe260) at exp_chan.c:122
#2 0x000b1148 in GetInput (chanPtr=0x173a48) at
./../generic/tclIO.c:5303
#3 0x000afb98 in DoReadChars (chanPtr=0x173a48, objPtr=0x164b88,
toRead=2092,
appendFlag=1) at ./../generic/tclIO.c:4447
#4 0x000af92c in Tcl_ReadChars (chan=0x173a48, objPtr=0x164b88,
toRead=4000,
appendFlag=1) at ./../generic/tclIO.c:4339
#5 0x000363d8 in expIRead (interp=0x15ba80, esPtr=0x176078,
timeout=10,
save_flags=0) at expect.c:1652
#6 0x000364b0 in expRead (interp=0x15ba80, esPtrs=0x172f80,
esPtrsMax=1,
esPtrOut=0xffbfe580, timeout=10, key=12317) at expect.c:1723
#7 0x000383d4 in Exp_ExpectObjCmd (clientData=0x0, interp=0x15ba80,
objc=13,
objv=0xffbfe9a8) at expect.c:2526
#8 0x00046d18 in TclEvalObjvInternal (interp=0x15ba80, objc=13,
objv=0xffbfe9a8, command=0x12d3a8 "", length=0, flags=0)
at ./../generic/tclBasic.c:3087
#9 0x00047058 in Tcl_EvalObjv (interp=0x15ba80, objc=13,
objv=0xffbfe9a8,
flags=0) at ./../generic/tclBasic.c:3203
#10 0x000332a4 in exp_eval_with_one_arg (clientData=0x0,
interp=0x15ba80,
(gdb) p *esPtr
$7 = {channel = 0x173a48, name = "exp4", '\0' <repeats 36 times>, fdin
= 4,
fdout = 4, channel_orig = 0x0, fd_slave = -1, validMask = 14, pid =
29846,
buffer = 0x164b88, msize = 4001, umsize = 2000, printed = 4, echoed
= 0,
rm_nulls = 1, open = 1, user_waited = 0, sys_waited = 0, registered
= 1,
wait = 0, parity = 1, key = 12317, force_read = 1, notified = 0,
notifiedMask = 2, fg_armed = 0, leaveopen = 0, bg_interp = 0x0,
bg_ecount = 0, bg_status = unarmed, freeWhenBgHandlerUnblocked = 0,
fdBusy = 0, keepForever = 0, valid = 1, nextPtr = 0x170b50}
(gdb) p toRead
$8 = 4096
(gdb) p bytesRead
$9 = 0
Mike
I have tracked this problem down to a bug in Tcl 8.4.5. It probably
exists in 8.2 since I tried this release and ran into the same hang.
I fixed the code and ran my simulation overnight without any problems.
I'll test this with our application to see if there is any other
fallout.
What's the process for filing a bug to the Tcl development team?
I'll try to describe the bug and fix here as a heads up for others
experiencing hangs.
As I said our application starts an rsh on a remote system. It keeps
this shell alive for several days as it executes tests. The shell is
fed commands and returns output. The rsh never ends and never sends
an EOF. The size of data returned varies. The case causing the error
returned about 140,000 bytes. The more data the better chance of
getting the hang.
The problem, Tcl is doing a read when it shouldn't. The previous read
has returned all the data (including the prompt which is our logical
eof). Tcl copies as much data as will fit into the application buffer
and returns. The next trip into TclIO (expect recognizes that the
prompt has not been returned and drives this) the remaining bytes are
copied from the tcl buffer to the user buffer and then Tcl decides it
can do another read. This hangs our application. Logical eof has
already been returned but we haven't seen it yet due to buffering. I
don't think Tcl should do another read unless the application has seen
all the data and drives the request.
My fix went into TclIO.c function DoReadChars. Caution, I still need
to verify this with the larger application but it does prevent the
hang with my simulator which is a scaled down Tcl/Expect application.
Original code from DoReadChars around line 4416,
for (copied = 0; (unsigned) toRead > 0; ) {
copiedNow = -1;
if (statePtr->inQueueHead != NULL) {
if (encoding == NULL) {
copiedNow = ReadBytes(statePtr, objPtr, toRead,
&offset)
} else {
copiedNow = ReadChars(statePtr, objPtr, toRead,
&offset,
&factor);
}
}
Fixed code,
for (copied = 0; (unsigned) toRead > 0; ) {
copiedNow = -1;
if (statePtr->inQueueHead != NULL) {
if (encoding == NULL) {
copiedNow = ReadBytes(statePtr, objPtr, toRead,
&offset);
} else {
copiedNow = ReadChars(statePtr, objPtr, toRead,
&offset,
&factor);
>>> New if (copiedNow > 0) {
>>> statePtr->flags |= CHANNEL_BLOCKED;
>>> }
}
}
This basically prevents another read and allows a return to user code.
The user will know if another read is necessary.
I'll post another followup after I try this with our full application.
Mike
Mike Ellenberger wrote:
>
>
>
> I have tracked this problem down to a bug in Tcl 8.4.5. It probably
> exists in 8.2 since I tried this release and ran into the same hang.
> I fixed the code and ran my simulation overnight without any problems.
> I'll test this with our application to see if there is any other
> fallout.
>
> What's the process for filing a bug to the Tcl development team?
http://sourceforge.net/tracker/?atid=110894&group_id=10894&func=browse
http://tcl.sf.net is the main page for Tcl development, & there you
can get to bugs, patches, source code, ...
Bruce
Just a short note here because I haven't finished testing. After
re-building Tcl with this fix I ran the self test. Some new failures
are reported,
==== io-32.8 Tcl_Read, nonblocking, file FAILED
==== io-36.5 Tcl_InputBlocked vs files, short read, nonblocking FAILED
==== io-36.6 Tcl_InputBlocked vs files, event driven read FAILED
On another note I'm testing a circumvention to prevent the hang. It
involves setting match_max to 204800 (this being much larger than what
I perceive is the largest chunk of data we'll need to buffer) and then
updating our logic so we get every byte of data whenever we "expect"
any (using "*" and -re ".+"). Instead of using the nice parsing
features of expect all this has to be done "by hand". Still if more
data is returned than the 204800 byte buffer, there is a good
likelyhood that this will still hang.
Mike
For anyone interested, I set the match_max size to 57344 ( 4096*14 )
and was able to run my simulation for 15+ hours without a hang. I'm
retrieving about 150000 bytes. This was with nasty updated logic
which empties the buffers and does all the parsing "by hand".
Increasing match_max to 57344 ( 4096*14 ) and using the original logic
I was able to run my simulation overnight (14+ hours) without a hang.
This is obviously the path of least resistance.
I ran our test system against a re-built version of Tcl with "the fix"
outlined in an earlier post. The run completed with no errors.
match_max was at the default value (2000), our Tcl logic was
unchanged.
Mike