tac_send_authen: unexpected EOF from server

Birger

unread,

Sep 18, 2013, 9:44:14 AM9/18/13

to event-driv...@googlegroups.com

Hi Marc,

i made up a new thread because this issue has nothing to do with the "load / tac_plus not stopping issue", which has been successfully fixed with version DEVEL.201309162034.

After i applied the patch (diff4.txt) and running tac_plus for about 15 Minutes i got 5 messages in the logs of our juniper devices:
[crit] %AUTH-2: tac_send_authen: unexpected EOF from server

I set debug level of tac_plus to PACKET/AUTHEN and the corresponding log entries are (identical to log sequence, where everything is working fine !!):
Sep 18 15:15:34 tac1 tac_plus[17058]: 140/9db813b8: New session
Sep 18 15:15:34 tac1 tac_plus[17058]: 140/9db813b8: authen: hdr->seq_no: 1
Sep 18 15:15:34 tac1 tac_plus[17058]: 140/9db813b8: looking for user scuser1 realm default
Sep 18 15:15:34 tac1 tac_plus[17058]: 140/9db813b8: user lookup failed
Sep 18 15:15:34 tac1 tac_plus[17058]: 140/9db813b8: Writing AUTHEN/GETPASS size=29
Sep 18 15:15:34 tac1 tac_plus[17058]: 140/9db813b8: authen: hdr->seq_no: 3
Sep 18 15:15:34 tac1 tac_plus[17058]: 140/9db813b8: looking for user scuser1 realm default
Sep 18 15:15:34 tac1 tac_plus[17058]: 140/9db813b8: user lookup failed
Sep 18 15:15:38 tac1 tac_plus[17058]: 140/9db813b8: looking for user scuser1 realm default
Sep 18 15:15:38 tac1 tac_plus[17058]: 140/9db813b8: cfg_get: checking user/group scuser1, tag (NULL)
Sep 18 15:15:38 tac1 tac_plus[17058]: 140/9db813b8: cfg_get: checking user/group RW15_SCUSER, tag (NULL)
Sep 18 15:15:38 tac1 tac_plus[17058]: 140/9db813b8: user lookup succeded
Sep 18 15:15:38 tac1 tac_plus[17058]: 140/9db813b8: cfg_get: checking user/group scuser1, tag (NULL)
Sep 18 15:15:38 tac1 tac_plus[17058]: 140/9db813b8: cfg_get: checking user/group RW15_SCUSER, tag (NULL)
Sep 18 15:15:38 tac1 tac_plus[17058]: 140/9db813b8: cfg_get: checking user/group scuser1, tag (NULL)
Sep 18 15:15:38 tac1 tac_plus[17058]: 140/9db813b8: cfg_get: checking user/group RW15_SCUSER, tag (NULL)
Sep 18 15:15:38 tac1 tac_plus[17058]: 140/9db813b8: cfg_get: checking user/group scuser1, tag (NULL)
Sep 18 15:15:38 tac1 tac_plus[17058]: 140/9db813b8: cfg_get: checking user/group RW15_SCUSER, tag (NULL)
Sep 18 15:15:38 tac1 tac_plus[17058]: 140/9db813b8: 82.XXX.XXX.XXX: shell login for 'scuser1' from 81.X.XXX.XX succeeded
Sep 18 15:15:38 tac1 tac_plus[17058]: 82.XXX.XXX.XXX: shell login for 'scuser1' from 81.X.XXX.XX succeeded
Sep 18 15:15:38 tac1 tac_plus[17058]: 140/9db813b8: cfg_get: checking user/group scuser1, tag (NULL)
Sep 18 15:15:38 tac1 tac_plus[17058]: 140/9db813b8: cfg_get: checking user/group RW15_SCUSER, tag (NULL)
Sep 18 15:15:38 tac1 tac_plus[17058]: 140/9db813b8: cfg_get: checking user/group scuser1, tag (NULL)
Sep 18 15:15:38 tac1 tac_plus[17058]: 140/9db813b8: cfg_get: checking user/group RW15_SCUSER, tag (NULL)
Sep 18 15:15:38 tac1 tac_plus[17058]: 140/9db813b8: Writing AUTHEN/PASS size=18

tcpdump on the tac_plus machine shows the following communication:

router     -> tac_plus    : TCP SYN
tac_plus   -> router      : TCP SYN, ACK
router     -> tac_plus    : TCP, ACK
router     -> tac_plus    : TACACS+ Q:Authentication // Login request with correct username
tac_plus   -> router : TCP, ACK
tac_plus   -> router : TACACS+ R:Authentication // (Send Password)
router     -> tac_plus    : TACACS+ Q:Authentication // submitting correct Pasword
tac_plus   -> router : TCP, ACK
<3 seconds>
router     -> tac_plus    : FIN, ACK
tac_plus   -> router : ACK
tac_plus   -> router : FIN, ACK
router     -> tac_plus    : ACK

Best regards,
Birger

Birger

unread,

Sep 18, 2013, 10:23:22 AM9/18/13

to event-driv...@googlegroups.com

Hi Marc,

just an amendment to my previous post: this happens only once in a while, in average 10-20 times per hour.
There is no particular high load when this happens. But the bad thing is, that the user cannot login, he has to make a second attempt.

I am very confused about this phenomenon.
Which OS do you use and recommend in production?

Best regards,
Birger

Marc Huber

unread,

Sep 18, 2013, 12:04:25 PM9/18/13

to event-driv...@googlegroups.com

Hi Birger,

On 18.09.13 16:23, Birger wrote:
> I am very confused about this phenomenon.

I think it's just timing related. A connection might be cleaned up by a
periodic garbage collection while there's still data available that
needs to be sent out. I've just uploaded DEVEL.201309181756.tar.bz2
which should/could/might fix that, unless I've missed yet another code
part where this could be relevant.

--- main.c 2012/10/06 07:41:01 1.267
+++ main.c 2013/09/18 15:54:51
@@ -179,7 +179,7 @@
{
rb_node_t *rbn, *rbnext;

- if ((ctx->timeout || ctx->dying)
+ if (!ctx->out && !ctx->delayed && (ctx->timeout || ctx->dying)
&& (ctx->last_io + ctx->timeout < io_now.tv_sec)) {
cleanup(ctx, ctx->sock);
return;

> Which OS do you use and recommend in production?

I really don't believe that this is an OS issue. Frankly, I prefer the
BSDs (no matter which), mostly because these run on a small footprint
and don't waste space with packages I don't need. And my development box
is a Mac, with an API that's pretty close to FreeBSD.

Cheers,

Marc

Birger

unread,

Sep 19, 2013, 6:58:08 AM9/19/13

to event-driv...@googlegroups.com, marc.j...@googlemail.com

Hi Marc,

sorry - but the problem persists. Again i've got 15 messages in about 1.5h ...

Best regards,
Birger

Marc Huber

unread,

Sep 19, 2013, 3:16:45 PM9/19/13

to event-driv...@googlegroups.com

Hi Birger,

On 19.09.13 12:58, Birger wrote:
> sorry - but the problem persists. Again i've got 15 messages in about
> 1.5h ...

I did have another look at your packet trace, and it looks like I was
wrong with my initial assumption that the daemon actually closes the
connection -- it obviously doesn't. Nevertheless, this did fix a
potential problem.

That being said, please retry with "setenv IO_POLL_MECHANISM = 8" at top
level. My event abstraction layer may be at fault, and I've never
trusted epoll() anyway.

Thanks,

Marc

Birger

unread,

Sep 20, 2013, 5:35:19 AM9/20/13

to event-driv...@googlegroups.com, marc.j...@googlemail.com

Hi Marc,

i set IO_POLL_MECHANISM = 8, but the errors occured again.

Best regards,
Birger

Marc Huber

unread,

Sep 21, 2013, 10:10:48 AM9/21/13

to event-driv...@googlegroups.com

Hi Birger,

On 20.09.13 11:35, Birger wrote:
> i set IO_POLL_MECHANISM = 8, but the errors occured again.

then I'm at a loss. I've currently no idea what could be causing that
problem, and I'm unable to reproduce it. I've done some extensive
testing, and haven't seen a single instance where the write_packet
function wasn't followed by a write() to the socket. I'll disable the
Nagle algorithm in the next snapshot, but I've serious doubts that this
will have any impact.

Cheers,

Marc

Birger

unread,

Sep 23, 2013, 4:03:28 AM9/23/13

to event-driv...@googlegroups.com, marc.j...@googlemail.com

Hi Marc,

thanks for all your efforts! I have a few questions:
1.) I saw the new version "DEVEL.201309221100.bz2" - is this already the version, where you disabled nagle algorithm?
2.) Is it possible to set the loglevel even more verbose (more than PACKET + AUTHEN)?
3.) Which BSD do you use in your production environment ( OpenBSD / FreeBDS / NetBSD )?

Best regards,
Birger

Marc Huber

unread,

Sep 23, 2013, 1:14:11 PM9/23/13

to event-driv...@googlegroups.com

Hi Birger,

On 23.09.13 10:03, Birger wrote:
> 1.) I saw the new version "DEVEL.201309221100.bz2" - is this already
> the version, where you disabled nagle algorithm?

yes, and there's another change that may or may not help (unused
connections are no longer removed directly but via the event loop).

> 2.) Is it possible to set the loglevel even more verbose (more than
> PACKET + AUTHEN)?

While there are other log levels (and "-1" covers everything) these
won't give you more relevant data for the particular issue. Attaching
dtruss/ktrace/strace/whatever-your-system-offers to a tacacs server
process would give more details (alas, more details than you want).

> 3.) Which BSD do you use in your production environment ( OpenBSD /
> FreeBDS / NetBSD )?

Development: MacOS (close enough to FreeBSD). Production: I don't have
personal control over production servers any more. Some are FreeBSD,
some are Debian Linux based. However, I could imagine that the main
reason for those systems not showing any relevant issues is related to
single-connection being configured. The software *should* be OS-agnostic.

Cheers,

Marc

Birger

unread,

Sep 24, 2013, 9:23:45 AM9/24/13

to event-driv...@googlegroups.com, marc.j...@googlemail.com

Hi Marc,

is there any possibility to turn on logging for the external auth script /usr/local/lib/mavis/mavis_tacplus_ldap.pl ?
I would like to be able to see errors for e.g timeouts. In /var/log/warn sometimes i can see tac_plus log entries, but nothing related to this ldap auth module.

Best regards,
Birger

Marc Huber

unread,

Sep 24, 2013, 12:49:48 PM9/24/13

to event-driv...@googlegroups.com

Hi Birger,

On 24.09.13 15:23, Birger wrote:
> is there any possibility to turn on logging for the external auth
> script /usr/local/lib/mavis/mavis_tacplus_ldap.pl ?

nope, that's not implemented. I seem to remember that whatever the Perl
script writes to stderr makes its way to syslog, but I may as well
remember this wrong.

Making some Perl script more verbose would of course be easy. However,
there's the question of what's actually worth to be logged.

Cheers,

Marc

Birger

unread,

Sep 25, 2013, 9:42:36 AM9/25/13

to event-driv...@googlegroups.com, marc.j...@googlemail.com

Hi Marc,

i wanted to test the single-connection feature.
In the documentation (4.3.2.3), the config statement is : "single-connect ...".
I always got th error "Unrecognized token 'single-connect'"

After changing the config statement to "single-connection ...", it worked.
Did i miss something or is this a mistake in the docs?

Best regards,
Birger

Marc Huber

unread,

Sep 25, 2013, 12:06:08 PM9/25/13

to event-driv...@googlegroups.com

Hi Birger,

On 25.09.13 15:42, Birger wrote:
> I always got th error "Unrecognized token 'single-connect'"
>
> After changing the config statement to "single-connection ...", it
> worked.
> Did i miss something or is this a mistake in the docs?

documentation glitch. Will be fixed in the next snapshot.

Thanks,

Marc

Birger

unread,

Sep 27, 2013, 12:48:25 PM9/27/13

to event-driv...@googlegroups.com, marc.j...@googlemail.com

Hi Marc,

the problem seems to be fixed, we are now waiting for a few days to verify this.
As far as we can see, it was caused by some network virtualization layer in our virtualization environment.

We are now running the primary systems (LDAP + tac_plus server) on dedicated hardware machines. So far, no such error occurred anymore.

Thanks a lot for your efforts! I am very glad, that the "load problem" was solved!!

Best regards,
Birger

Reply all

Reply to author

Forward