[9fans] broken smtpd

Kenji Arisawa

unread,

Nov 5, 2008, 11:31:26 PM11/5/08

to Fans of the OS Plan 9 from Bell Labs

Hello,

I sometimes observe broken smtpd.

ar% ps
...
none 1948578 0:01 44:55 264K Broken smtpd
none 2521172 0:01 44:58 264K Broken smtpd
none 3190037 0:01 45:03 264K Broken smtpd
...

ar% ACID 3190037
/proc/3190037/text:386 plan 9 executable
/sys/lib/acid/port
/sys/lib/acid/386
acid: lstk()
errstr()+0x7 /sys/src/libc/9syscall/errstr.s:5
syslog(logname=0x3760a,cons=0x0,fmt=0x37610)+0x20 /sys/src/libc/9sys/
syslog.c:46
err=0x616d2f27
d=0x0
ctim=0x0
buf=0x0
p=0x0
arg=0x0
n=0x0
catchalarm(msg=0xdfffe1ac)+0x5d /sys/src/cmd/upas/smtp/smtpd.c:68
rv=0x1
notifier+0x30 /sys/src/libc/port/atnotify.c:15
acid:

ar% pr -tn /sys/src/libc/9sys/syslog.c
...
34 void
35 syslog(int cons, char *logname, char *fmt, ...)
36 {
37 char buf[1024];
38 char *ctim, *p;
39 va_list arg;
40 int n;
41 Dir *d;
42 char err[ERRMAX];
43
44 err[0] = ' 0';
45 errstr(err, sizeof err);
46 lock(&sl);
47
...

Is this phenomenon specific to me?

Kenji Arisawa

Richard Miller

unread,

Nov 6, 2008, 5:33:08 AM11/6/08

to

> Is this phenomenon specific to me?

No, it isn't:

term% ps|grep Broken
none 917 2:08 42:50 264K Broken smtpd

erik quanstrom

unread,

Nov 6, 2008, 8:12:35 AM11/6/08

to

impressive!

i don't see this problem, but i have implemented
some fairly agressive spam control. perhaps this
shows up after i tend to hang up.

(it is possible to use nupas smtpd with regular old
upas.)

would anyone mind sending me a full backtrace?

- erik

Kenji Arisawa

unread,

Nov 17, 2008, 7:07:42 AM11/17/08

to

Hello,

Today, I observed smtpd that was in endless loop and then broken.

ar% ps
...
none 20392 0:01 28:27 264K Running smtpd
...

ar% ACID 20392
/proc/20392/text:386 plan 9 executable
/sys/lib/acid/port
/sys/lib/acid/386
acid: lstk()
receiver(path=0x41f58)+0x1c5 /sys/src/cmd/upas/smtp/smtpd.c:625
sender=0x379bf
rcpt=0x41ad8
zzparse()+0x2c9 /sys/src/cmd/upas/smtp/smtpd.y:773
save1=0x0
save2=0x0
save3=0x0
save4=0x0
zzstate=0xf
zzchar=0xffffffff
zzs=0x3f36b
zzp=0xdfffe41c
zzn=0xfffffc18
main(argv=0xdfffefa4,argc=0x0)+0x261 /sys/src/cmd/upas/smtp/smtpd.c:194
netdir=0xdfffefbc
_argc=0x6e
_args=0x376c2
buf=0x0
_main+0x31 /sys/src/libc/386/main9.s:16
acid: src(receiver+0x1c5)
/sys/src/cmd/upas/smtp/smtpd.c:625
620
621 logged = 0;
622
623 /* forwarding() can modify 'path' on loopback request */
624 if(filterstate == ACCEPT && fflag && !authenticated &&
forwarding(path)) {
>625 syslog(0, "smtpd", "Bad Forward %s (%s/%s) (%s)",
626 s_to_c(senders.last->p), him, nci->rsys, s_to_c(path));
627 rejectcount++;
628 reply("550 5.7.1 we don't relay. send to your-path@[] for "
629 "loopback.\r\n");
630 return;

I don't observe "Bad Forward" of this mail in /sys/log/smtpd.

After some amount of time,

ar% ps
...
none 20392 0:02 44:57 264K Broken smtpd
...

acid: lstk()
errstr()+0x7 /sys/src/libc/9syscall/errstr.s:5
syslog(logname=0x3760a,cons=0x0,fmt=0x37610)+0x20 /sys/src/libc/9sys/
syslog.c:46
err=0x616d2f27
d=0x0
ctim=0x0
buf=0x0
p=0x0
arg=0x0
n=0x0
catchalarm(msg=0xdfffe1ac)+0x5d /sys/src/cmd/upas/smtp/smtpd.c:68
rv=0x1
notifier+0x30 /sys/src/libc/port/atnotify.c:15

acid: src(syslog+0x29)
/sys/src/libc/9sys/syslog.c:46

41 Dir *d;
42 char err[ERRMAX];
43
44 err[0] = '\0';
45 errstr(err, sizeof err);
>46 lock(&sl);
47

48 /*
49 * paranoia makes us stat to make sure a fork+close
50 * hasn't broken our fd's
51 */
acid:

I hope the above information is helpful to fix the problem.

Kenji Arisawa

erik quanstrom

unread,

Nov 17, 2008, 8:02:34 AM11/17/08

to

d'oh! the return value from catchalarm looks reversed. from notify(2)

[...] A handler must
return a non-zero number if the note was recognized (and
resolved); otherwise it must return zero. When the system

i think you're getting into some sort of note loop. i think it would
be helpful to log all notes. i would try something like this

static int
catchalarm(void *, char *msg)
{
static int chattycathy;

if(chattycathy++ > 5)
return 0;
if(senders.first && rcvers.first)
syslog(0, "smtpd", "note: %s->%s: %s",
s_to_c(senders.first->p),
s_to_c(rcvers.first->p), msg);
else
syslog(0, "smtpd", "note: %s", msg);
if(pp){
syskillpg(pp->pid);
pp = 0;
}
return strstr(msg, "alarm") != 0;
}

but at a minimum, i would reverse the return values.

- erik

Kernel Panic

unread,

Nov 17, 2008, 8:35:38 AM11/17/08

to

erik quanstrom wrote:
> d'oh! the return value from catchalarm looks reversed. from notify(2)
>
> [...] A handler must
> return a non-zero number if the note was recognized (and
> resolved); otherwise it must return zero. When the system
>
> i think you're getting into some sort of note loop. i think it would
> be helpful to log all notes. i would try something like this
>

its not a good idea to call syslog() in a note handler, because it
aquires locks
and calls into the memory allocator by strdup() (that too aquires locks).

if the note interrupts the program while in syslog() or the allocator,
it will deadlock.

--
cinap

erik quanstrom

unread,

Nov 17, 2008, 8:49:05 AM11/17/08

to

> its not a good idea to call syslog() in a note handler, because it
> aquires locks
> and calls into the memory allocator by strdup() (that too aquires locks).
>
> if the note interrupts the program while in syslog() or the allocator,
> it will deadlock.

excellent point. in this case, i think it is more likely that the problem
is more mundane than that. in any event, we are chewing gobs
of cpu time, so we don't have a literal deadlock.

- erik

Charles Forsyth

unread,

Nov 17, 2008, 8:53:33 AM11/17/08

to

if you look at /dev/kmesg you might find many messages about "checked N page table entries".
i think smtpd is getting a trap (probably a null pointer) and because the value returned
from catchalarm is the wrong way round, it continues to retry using the bad pointer until
the alarm goes off (after a long time), when it breaks (because no note handler has
handled the alarm).

i didn't really understand why it was trying to continue after the alarm,
rather than logging and quitting. does it not always mean that a caller
has connected but hasn't completed a request after a long time?
as cinap noted(!) there's a separate problem of what happens if the trap or alarm
happens when locks are held, but i think that's less likely to be a problem in practice:
an alarm will probably go off when nothing much has happened (ie, the program isn't
holding any other lock but waiting for input). it's harder to be sure about
bad pointer traps, but if those happen with locks held,
i'd suspect those are more likely to be locks unrelated to those needed by syslog and print.

Kenji Arisawa

unread,

Nov 18, 2008, 7:28:46 PM11/18/08

to

Hello,

I added some debugging codes to catchalarm() in smtpd.c

static int
catchalarm(void *a, char *msg)
{
int rv = 1;
static int count = 0; // DEBUG by Kenar
count++;
if(count > 3)
return 0;

USED(a);

/* log alarms but continue */
if(strstr(msg, "alarm")){

if(senders.first && rcvers.first)
syslog(0, "smtpd", "note: %s->%s: %s",
s_to_c(senders.first->p),
s_to_c(rcvers.first->p), msg);
else
syslog(0, "smtpd", "note: %s", msg);

rv = 0;
}

syslog(0, "smtpd", "note: %s", msg); // DEBUG by Kenar

/* kill the children if there are any */
if(pp)
syskillpg(pp->pid);

return rv;
}

after that I found broken smtpd:
none 50392 0:00 0:00 264K Broken smtpd

note that wasted cpu time is now 0:00

/sys/log/smtpd shows:
ar Nov 19 02:23:05 ehlo from 124.8.67.36 as sotcndhz.com
ar Nov 19 02:23:08 note: sys: trap: fault read addr=0x4 pc=0x00007463
ar Nov 19 02:23:08 note: sys: trap: fault read addr=0x4 pc=0x00007463
ar Nov 19 02:23:08 note: sys: trap: fault read addr=0x4 pc=0x00007463

Kenji Arisawa

erik quanstrom

unread,

Nov 18, 2008, 7:53:28 PM11/18/08

to

acid tells us that we're indirecting a nil senders.last->p.
we know it's senders.last->p because it's at offset 4.
(that was the fault address.) the Lock at the beginning
of String is bigger than 4 bytes by itself, so it's not the
String and rsys is not the 2d pointer in NetConnInfo,
so that eliminates rsys.

acid: src(0x7463)
/sys/src/cmd/upas/smtp/smtpd.c:628
623 logged = 0;
624
625 /* forwarding() can modify 'path' on loopback request */
626 if(filterstate == ACCEPT && fflag && !authenticated && forwarding(path)) {
627 syslog(0, "smtpd", "Bad Forward %s (%s/%s) (%s)",
>628 s_to_c(senders.last->p), him, nci->rsys, s_to_c(path));
629 rejectcount++;
630 reply("550 5.7.1 we don't relay. send to your-path@[] for "
631 "loopback.\r\n");
632 return;
633 }

replace lines 627,628 with

syslog(0, "smtpd", "Bad Forward %s (%s/%s) (%s)",

sender, him, nci->rsys, rcpt);

i made this change some time ago and it appears to be working fine.
i may have seen a similar fault.

- erik