write failure on SUSE-11 leads to crash in tcl 8.5.9

Vikki

unread,

Jul 9, 2012, 3:55:40 AM7/9/12

to

On SUSE-11 machines I am getting one write error from tcl internal.
And unfortunatly this is causing my application to crash. As due to
special handling introduced in tcl 8.5.9 a write fail, calls a TCL
Panic function.

Following is the link for the report of the tcl commit that is causing
this issue.
http://www.rkeene.org/projects/tcl/tcl.fossil/fdiff?v1=51b7f49162c54e25&v2=3c2ab6ccbd85bfea

Has anybody faced this issue earlier?
How safe is it to change the TCL internal code to ignore this write
error?
What could be the possible reasons of getting a write error? As I do
not get the same on redhat 4.
Can I just change some options in SUSE to make it work like RHEL ?

Alexandre Ferrieux

unread,

Jul 9, 2012, 4:36:33 AM7/9/12

to

On Jul 9, 9:55 am, Vikki <harish.t...@gmail.com> wrote:
> On SUSE-11 machines I am getting one write error from tcl internal.
> And unfortunatly this is causing my application to crash. As due to
> special handling introduced in tcl 8.5.9 a write fail, calls a TCL
> Panic function.
>
> Following is the link for the report of the tcl commit that is causing

> this issue.http://www.rkeene.org/projects/tcl/tcl.fossil/fdiff?v1=51b7f49162c54e...

>
> Has anybody faced this issue earlier?
> How safe is it to change the TCL internal code to ignore this write
> error?
> What could be the possible reasons of getting a write error? As I do
> not get the same on redhat 4.
> Can I just change some options in SUSE to make it work like RHEL ?

You already asked one week ago:

https://groups.google.com/group/comp.lang.tcl/browse_frm/thread/bb5041da00369ec7?hl=en#

why not follow the advice given there (strace output), and reply here
on the newsgroup (not by e-mail, as I already told you) ?

-Alex

Vikki

unread,

Jul 9, 2012, 5:29:01 AM7/9/12

to

On Jul 9, 1:36 pm, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
wrote:

> On Jul 9, 9:55 am, Vikki <harish.t...@gmail.com> wrote:
>
> > On SUSE-11 machines I am getting one write error from tcl internal.
> > And unfortunatly this is causing my application to crash. As due to
> > special handling introduced in tcl 8.5.9 a write fail, calls a TCL
> > Panic function.
>
> > Following is the link for the report of the tcl commit that is causing
> > this issue.http://www.rkeene.org/projects/tcl/tcl.fossil/fdiff?v1=51b7f49162c54e...
>
> > Has anybody faced this issue earlier?
> > How safe is it to change the TCL internal code to ignore this write
> > error?
> > What could be the possible reasons of getting a write error? As I do
> > not get the same on redhat 4.
> > Can I just change some options in SUSE to make it work like RHEL ?
>
> You already asked one week ago:
>

> https://groups.google.com/group/comp.lang.tcl/browse_frm/thread/bb504...

>
> why not follow the advice given there (strace output), and reply here
> on the newsgroup (not by e-mail, as I already told you) ?
>
> -Alex

Hi Alex,

I sent you the mail, as I was not able to post the strace output on
this post, I guess there's a limit of words that one can post.
I am posting the the limited output of strace and lsof here.

STRACE OUTPUT
9956 futex(0x88dea0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
9725 futex(0x26d1f14, 0x189 /* FUTEX_??? */, 1762223, {1341825405,
649556000}, ffffffff <unfinished ...>
9956 <... futex resumed> ) = 0
9725 <... futex resumed> ) = -1 ETIMEDOUT (Connection
timed out)
9956 select(17, [3 5 16], [], [], NULL <unfinished ...>
9725 write(4, "\0", 1 <unfinished ...>
9956 <... select resumed> ) = 1 (in [3])
9725 <... write resumed> ) = -1 EAGAIN (Resource
temporarily unavailable)
9956 futex(0x88dea0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
9725 write(2, "Tcl_WaitForEvent: unable to writ"..., 48) = 48

Output of > ll /proc/9972/fd
lrwx------ 1 hbansal abc 64 2012-07-09 14:53 5 -> socket:[262666404]
l-wx------ 1 hbansal abc 64 2012-07-09 14:53 4 -> pipe:[262666384]
lr-x------ 1 hbansal abc 64 2012-07-09 14:53 3 -> pipe:[262666384]

Output of > lsof | grep 262666384
vish 9972 hbansal 3r FIFO 0,8 0t0
262666384 pipe
vish 9972 hbansal 4w FIFO 0,8 0t0
262666384 pipe

I can send you the complete strace output on your mail id if needed.

My questions remain the same.

How safe is it to change the TCL internal code to ignore this write
error?
What could be the possible reasons of getting a write error? As I do
not get the same on redhat 4.
Can I just change some options in SUSE to make it work like RHEL ?

Please help me out this problem is really bugging me a lot now.

Thanks
Harish Bansal

Alexandre Ferrieux

unread,

Jul 9, 2012, 11:24:52 AM7/9/12

to

Good, posting a short excerpt is the way to go. Too bad you forgo two
options (-tt and -T), there's no timing info there...

> 9725 write(4, "\0", 1 <unfinished ...>
> 9956 <... select resumed> ) = 1 (in [3])
> 9725 <... write resumed> ) = -1 EAGAIN (Resource
> temporarily unavailable)

> I can send you the complete strace output on your mail id if needed.

You already did, and there, the -tt -T options were not forgotten ;-)

After a bit of reformatting (please use attachments, not text pasting,
those line breaks are a PAIN) and filtering, I extracted the following
from your trace. It is only the read/writes on the notifier pipe.

8633 11:24:47.624788 write(4, "\0", 1 <unfinished ...>
8633 11:24:47.624803 <... write resumed> ) = 1 <0.000009>
8673 11:24:47.624865 read(3, <unfinished ...>
8633 11:24:47.624871 write(4, "\0", 1 <unfinished ...>
8673 11:24:47.624878 <... read resumed> "\0", 1) = 1 <0.000009>
8633 11:24:47.624885 <... write resumed> ) = 1 <0.000008>
8673 11:24:47.624974 read(3, <unfinished ...>
8633 11:24:47.624980 write(4, "\0", 1 <unfinished ...>
8673 11:24:47.624987 <... read resumed> "\0", 1) = 1 <0.000009>
8633 11:24:47.624994 <... write resumed> ) = 1 <0.000009>
8633 11:24:47.625068 write(4, "\0", 1 <unfinished ...>
8633 11:24:47.625083 <... write resumed> ) = 1 <0.000009>
8673 11:24:47.625142 read(3, <unfinished ...>
8673 11:24:47.625154 <... read resumed> "\0", 1) = 1 <0.000008>
8633 11:24:47.625177 write(4, "\0", 1 <unfinished ...>
8633 11:24:47.625192 <... write resumed> ) = 1 <0.000009>
8673 11:24:47.625254 read(3, <unfinished ...>
8633 11:24:47.625260 write(4, "\0", 1 <unfinished ...>
8673 11:24:47.625267 <... read resumed> "\0", 1) = 1 <0.000009>
8633 11:24:47.625274 <... write resumed> ) = 1 <0.000009>
8673 11:24:47.625363 read(3, <unfinished ...>
8633 11:24:47.625369 write(4, "\0", 1 <unfinished ...>
8673 11:24:47.625375 <... read resumed> "\0", 1) = 1 <0.000008>
8633 11:24:47.625383 <... write resumed> ) = -1 EAGAIN (Resource
temporarily unavailable) <0.000009>

As you can see, in the small time of this snapshot (600 microseconds),
one can already see that the read side is slower than the write side
(7 write, 5 reads).

To confirm this analysis, a strace from the beginning (restricted to
reads and writes: strace -tt -T -f -o file.tra -w
trace=read,write ...) would show that the "drift" between reads and
writes reaches one pipe-buffer (typically one page, ie 4k), hence the
EAGAIN. It would be nice if you could verify this.

The next questions, then, are:

(1) what circumstances lead to such an overrun (the core sending
config updates to the notifier thread, at a faster pace than
reasonably sustainable) ?

(2) why doesn't the core use blocking writes here.

Could you please elaborate on your app to investigate (1) ?
I will pursue (2) with other core maintainers.

-Alex

Alexandre Ferrieux

unread,

Jul 9, 2012, 12:13:52 PM7/9/12

to

On Jul 9, 5:24 pm, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
wrote:

> On Jul 9, 11:29 am, Vikki <harish.t...@gmail.com> wrote:
>
> > > > On SUSE-11 machines I am getting one write error from tcl internal.
> > > > And unfortunatly this is causing my application to crash. As due to
> > > > special handling introduced in tcl 8.5.9 a write fail, calls a TCL
> > > > Panic function.
>
> > > > Following is the link for the report of the tcl commit that is causing
> > > > this issue.http://www.rkeene.org/projects/tcl/tcl.fossil/fdiff?v1=51b7f49162c54e...
>
> > > > Has anybody faced this issue earlier?
> > > > How safe is it to change the TCL internal code to ignore this write
> > > > error?

> (2) why doesn't the core use blocking writes here.

A cursory scan of the commit you mentioned... leaves me uneasy :P

To me, since the triggerPipe is documented to be protected by
notifierMutex, it sounds reasonable to write to it in nonblockingmode
for the "wake-up" mode (writing a single null byte): when an unending
stream of updates comes, several updates can share a single "wake-up",
so losing one of them in an overrun is no issue. This validates this
move from 1999:

1999-10-20 Jeff Hobbs <ho...@scriptics.com>

* unix/tclUnixNotfy.c: fixed event/io threading problems by making
triggerPipe non-blocking. [Bug 2792]

As a consequence, as your example shows, if an overrun occurs,
ignoring the write()==EAGAIN should be harmless. So I think the
2009-12-16 commit is too heavy handed, since it moves from "harmless"
to "Tcl_Panic".

The commit comment is not really helpful:

File unix/tclUnixNotfy.c
2009-12-16 23:25:59 - part of checkin [e52afe7fa4] on branch trunk -
Fix gcc warning: ignoring return value of ‘write’, declared with
attribute warn_unused_result CONSTify functions
TclpGetUserHome and TclSetPreInitScript (TIP #27) (user: nijtmans )
[annotate]

I have created ticket 3541646 to request advice on this:

https://sourceforge.net/tracker/?func=detail&aid=3541646&group_id=10894&atid=110894

-Alex

haris...@gmail.com

unread,

Jul 10, 2012, 12:37:13 AM7/10/12

to

On Monday, 9 July 2012 21:43:52 UTC+5:30, Alexandre Ferrieux wrote:
> On Jul 9, 5:24 pm, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
> wrote:
> > On Jul 9, 11:29 am, Vikki <harish.t...@gmail.com> wrote:
> >
> > > > > On SUSE-11 machines I am getting one write error from tcl internal.
> > > > > And unfortunatly this is causing my application to crash. As due to
> > > > > special handling introduced in tcl 8.5.9 a write fail, calls a TCL
> > > > > Panic function.
> >
> > > > > Following is the link for the report of the tcl commit that is causing
> > > > > this issue.http://www.rkeene.org/projects/tcl/tcl.fossil/fdiff?v1=51b7f49162c54e...
> >
> > > > > Has anybody faced this issue earlier?
> > > > > How safe is it to change the TCL internal code to ignore this write
> > > > > error?
>
> > (2) why doesn't the core use blocking writes here.

>
> A cursory scan of the commit you mentioned... leaves me uneasy :P
>
> To me, since the triggerPipe is documented to be protected by
> notifierMutex, it sounds reasonable to write to it in nonblockingmode

> for the "wake-up" mode (writing a single null byte): when an unending
> stream of updates comes, several updates can share a single "wake-up",

> so losing one of them in an overrun is no issue. This validates this
> move from 1999:
>

> 1999-10-20 Jeff Hobbs <ho...@scriptics.com>

>
> * unix/tclUnixNotfy.c: fixed event/io threading problems by making
> triggerPipe non-blocking. [Bug 2792]
>
> As a consequence, as your example shows, if an overrun occurs,
> ignoring the write()==EAGAIN should be harmless. So I think the

> 2009-12-16 commit is too heavy handed, since it moves from "harmless"
> to "Tcl_Panic".

>
> The commit comment is not really helpful:
>
> File unix/tclUnixNotfy.c
> 2009-12-16 23:25:59 - part of checkin [e52afe7fa4] on branch trunk -
> Fix gcc warning: ignoring return value of ‘write’, declared with
> attribute warn_unused_result CONSTify functions
> TclpGetUserHome and TclSetPreInitScript (TIP #27) (user: nijtmans )
> [annotate]
>
>
> I have created ticket 3541646 to request advice on this:
>

> https://sourceforge.net/tracker/?func=detail&aid=3541646&group_id=10894&atid=110894
>
> -Alex
----------------------------------------------------------------------
Thanks Alex, this will help alot.

Apart from running my application on SUSE-11 I am not doing anything special as usually we run it on RHEL-4 and we haven't faced this problem on RHEL-4 ever.

Anyways, I 'll check for the (1) that what is causing the write to be faster than usual or the read to be slower than usual.
Any pointer on what makes the core sending updates to notifier thread faster or slower? Usual update calls?

- Harish Bansal

Alexandre Ferrieux

unread,

Jul 10, 2012, 6:49:31 AM7/10/12

to

> >https://sourceforge.net/tracker/?func=detail&aid=3541646&grou...

>
> > -Alex
>
> ----------------------------------------------------------------------
> Thanks Alex, this will help alot.
>
> Apart from running my application on SUSE-11 I am not doing anything special as usually we run it on RHEL-4 and we haven't faced this problem on RHEL-4 ever.
>
> Anyways, I 'll check for the (1) that what is causing the write to be faster than usual or the read to be slower than usual.
> Any pointer on what makes the core sending updates to notifier thread faster or slower? Usual update calls?

No, it is typically [fileevent] or [close] calls in short succession.

-Alex

quiet_lad

unread,

Jul 10, 2012, 1:21:27 PM7/10/12

to

On Jul 10, 3:49 am, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
wrote:

dont use suse

if linux then archlinux

Harald Oehlmann

unread,

Jul 11, 2012, 2:20:26 AM7/11/12

to

On Jul 10, 7:21 pm, quiet_lad <gavcom...@gmail.com> wrote:
> dont use suse

As Reinhard Max, how contributed the IP V6 code for TCL, is a SuSE
employee and the maintainer, we have the pleasure, that we always have
very recent and good quality TCL packages for SuSE.
You may consider this for your Linux choice too.

And, by the way, the Rivet package for SuSE is also the best compared
to all other Linux flavours.

Enjoy,
Harald

Alexandre Ferrieux

unread,

Jul 11, 2012, 10:18:33 AM7/11/12

to

On Jul 9, 6:13 pm, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
wrote:

> > > > > > How safe is it to change the TCL internal code to ignore this write
> > > > > error?
>

> A cursory scan of the commit you mentioned... leaves me uneasy :P

> I have created ticket 3541646 to request advice on this:
>

> https://sourceforge.net/tracker/?func=detail&aid=3541646&group_id=108...

Update: Jan Nijtmans reverted that move yesterday in both 8.5 and 8.6
branches.
As a consequence, 8.5.12 which is just about to ship, will contain the
fix.

-Alex

haris...@gmail.com

unread,

Jul 11, 2012, 1:51:26 PM7/11/12

to

On Wednesday, 11 July 2012 19:48:33 UTC+5:30, Alexandre Ferrieux wrote:
> On Jul 9, 6:13 pm, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
> wrote:
> > > > > > > How safe is it to change the TCL internal code to ignore this write
> > > > > > error?
> >
> > A cursory scan of the commit you mentioned... leaves me uneasy :P
> > I have created ticket 3541646 to request advice on this:
> >
> > https://sourceforge.net/tracker/?func=detail&aid=3541646&group_id=108...

>
> Update: Jan Nijtmans reverted that move yesterday in both 8.5 and 8.6
> branches.
> As a consequence, 8.5.12 which is just about to ship, will contain the
> fix.
>
> -Alex

Thanks Alex,
I have tested this fix by merging into 8.5.9. And it is running good. :)
Thanks for your help.

- Harish Bansal