High CPU Usage

167 views
Skip to first unread message

Brett Barton

unread,
Mar 21, 2012, 8:20:37 AM3/21/12
to Event-Driven Servers
As I mentioned in a previous thread sometime tac_plus consumes the
entire CPU. It is currently doing that now.

Here is the output of PS:

root 10190 0.0 0.1 3924 712 ? Ss Mar14 0:13
tac_plus: 5 connections, accepting up to 595
more
tacacs 12810 97.2 0.6 6120 3380 ? Rs Mar16 6546:25
tac_plus: 5 connections
tacacs 18772 0.0 0.3 4644 1896 ? Ss 08:03 0:00
tac_plus: 0 connections

Relevant top output:

12810 tacacs 20 0 6120 3380 1316 R 99.7 0.7 6547:14 tac_plus

Currently running Debian 5.0.7, here is uname output:

Linux 2.6.26-2-686 #1 SMP Thu Nov 25 01:53:57 UTC 2010 i686 GNU/Linux

The Debian system is running on VMware 4.1.0 and the VM specs are:

Memory: 512 MB
CPUs: 1
SCSI Controller 0: LSI LogicParallel

What other info can I provide to be helpful?

Marc Huber

unread,
Mar 21, 2012, 3:24:39 PM3/21/12
to event-driv...@googlegroups.com
Hi Brett,


On Wednesday, March 21, 2012 1:20:37 PM UTC+1, Brett Barton wrote:
As I mentioned in a previous thread sometime tac_plus consumes the
entire CPU. It is currently doing that now.

I'll try to reproduce that over the weekend.

As it might be a Linux specific problem in my io abtraction layer, you could try switching to the "poll" code (instead of "epoll", which is the default for Linux), by adding

setenv IO_POLL_MECHANISM=8

at the top of your configuration file.

Thanks,

Marc



Message has been deleted

Rodrigo Pescador

unread,
Mar 22, 2012, 9:17:33 AM3/22/12
to Event-Driven Servers
Hi Marc,

I'm experiencing the same problem. I have two tacacs+ servers that
work in redundancy (mirror) and suddenly they consume the entire CPU.
It happened twice since I've installed them six months ago.

Nowadays, my environment control about 800 devices (Cisco, D-Link,
3COM...).

The symptoms are the same that Barton reported.

Next time it happens, I'll try to get more informations. What kind of
informations would be useful?

My servers configuration:
- Debian 6.0.2 - AMD64
- 2 CPU
- 2048MB RAM
- tac_plus 201109170923

Thanks in advance!

Rodrigo.

Marc Huber

unread,
Mar 22, 2012, 4:02:06 PM3/22/12
to Event-Driven Servers
Hi Rodrigo,

On 22 Mrz., 14:17, Rodrigo Pescador <rope...@gmail.com> wrote:
> Hi Marc,
>
> I'm experiencing the same problem. I have two tacacs+ servers that
> work in redundancy (mirror) and suddenly they consume the entire CPU.
> It happened twice since I've installed them six months ago.

you could try

setenv IO_POLL_MECHANISM=8

at top-level. Chances are that this is a bug in either my epoll code
(or in the kernel implementation). Setting the IO_POLL_MECHANISM value
to 8 will enable stock poll(2) code, which should/could be more
stable.

The major problem with this bug is that it doesn't look like I could
easily reproduce it in a test setup.

Thanks,

Marc

Brett Barton

unread,
Mar 23, 2012, 1:53:33 PM3/23/12
to Event-Driven Servers
Alright, I have added that line to the very top of my cofiguration
file. TACACS+ loaded up just fine, I will report back if/when it
exhibits the same behavior. Also if it does seem to fix it, I will
report that too!

Thanks for the quick replies!

Rodrigo Pescador

unread,
Mar 24, 2012, 9:00:25 AM3/24/12
to Event-Driven Servers
Hi Marc,

Ok, I'll try to add this line in my configuration.

Thank you!

Luis Fernando Cordeiro

unread,
Mar 26, 2012, 11:05:41 AM3/26/12
to event-driv...@googlegroups.com
Hi Marc,

I'm Rodrigo's co-worker. We tried the changes that you suggested but when I try to restart the deamon I get an error. 

Our current snapshot is dated from 201109170923, should we update it?

I got the following error:

/etc/init.d/tac_plus start
setenv 
/usr/local/etc/tac_plus.cfg:1: Expected 'id', 'include', 'debug', 'syslog', 'proctitle', 'gcore' or 'debug-cmd', but got 'setenv'
********************************************************************************
* Unable to start tac_plus ... please fix the configuration problem
* indicated above.
********************************************************************************

Thanks in advance! 

Luís

Marc Huber

unread,
Mar 26, 2012, 11:43:13 AM3/26/12
to Event-Driven Servers
Hi Luís,

yes, I'd recommend to update to the current snapshot. Granted, that's
actually my default suggestion, but I'm not even certain about the
date I implemented setenv at top-level, and the most recent snapshot
is where bugs get fixed (if found).

Thanks,

Marc


On 26 Mrz., 17:05, Luis Fernando Cordeiro <luis.lfcorde...@gmail.com>
wrote:

rss

unread,
May 4, 2012, 2:52:58 AM5/4/12
to event-driv...@googlegroups.com
Hello,

same problem here. tac_plus snapshot 201204260918, RHEL6 64bit, setenv IO_POLL_MECHANISM=8(before without this parameter in the config with the same behaviour) at the top of the tac_plus.cfg. I have two servers and the issue occurs randomly allmost every week, sometimes more offen, triggered by more aaa request at the same time. I collect some statistics during night and the time when it start is the time when usually CPU goes to 100% usage (sometimes only one CPU core and sometimes both). Duration of the hight CPU usage is again random (hours, days...).

russ

Dňa štvrtok, 22. marca 2012 21:02:06 UTC+1 Marc Huber napísal(-a):

Marc Huber

unread,
May 6, 2012, 9:56:05 AM5/6/12
to event-driv...@googlegroups.com
Hi,

I still can't reproduce this behaviour in my lab environment. There are various places where the main event loop is used. I'll need more information to localize this bug.

What logging type do you use (synchronous, asynchronous, syslog, pipe), and what MAVIS backends?

Cheers,

Marc

Aaron Turner

unread,
May 10, 2012, 12:29:22 PM5/10/12
to event-driv...@googlegroups.com
Hey Marc,

I'm seeing the same sort of thing as described here on our CentOS 5.4
x86_64 box.

Basically we'll see multiple tac_plus processes (usually 4-5, but
sometimes many more) running eating as much CPU as they can. The
problem seems to happen after tac_plus is restarted after the config
file is updated. We have some gear like Force10 which does stateful
cli auth, so I wonder if this is related to old active tacacs sessions
not being shutdown properly after the main tac_plus process is kill
-9'd.

Anyways, not sure if the problem I'm describing matches what other
people are seeing, but hopefully it helps.

Version: 201204101949/DES

I can send you a copy of our tac_plus.conf off list if you'd like.

Thanks,
Aaron
> --
> You received this message because you are subscribed to the Google Groups
> "Event-Driven Servers" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/event-driven-servers/-/3BPWI5jpMAQJ.
>
> To post to this group, send email to event-driv...@googlegroups.com.
> To unsubscribe from this group, send email to
> event-driven-ser...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/event-driven-servers?hl=en.



--
Aaron Turner
http://synfin.net/         Twitter: @synfinatic
http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
    -- Benjamin Franklin
"carpe diem quam minimum credula postero"

Marc Huber

unread,
May 10, 2012, 3:05:01 PM5/10/12
to event-driv...@googlegroups.com
Hi Aaron,


On Thursday, May 10, 2012 6:29:22 PM UTC+2, synfinatic wrote:
The problem seems to happen after tac_plus is restarted after the config
file is updated.  We have some gear like Force10 which does stateful
cli auth, so I wonder if this is related to old active tacacs sessions
not being shutdown properly after the main tac_plus process is kill
-9'd.

thanks, this gives a clue. I've just uploaded DEVEL.201205102052.tar.bz2, with the following patch applied:

--- tac_plus/main.c    2011/11/06 09:58:00    1.261
+++ tac_plus/main.c    2012/05/10 18:42:59
@@ -67,7 +67,7 @@
 cleanup_spawnd(struct context *ctx __attribute__ ((unused)),
            int cur __attribute__ ((unused)))
 {
-    if (common_data.users_cur == 0 && ctx_spawnd) {
+    if (ctx_spawnd) {
     ctx_spawnd = NULL;
     io_unregister(ctx->io, 0);
     close(0);

Could you (and everyone else who's seeing the high CPU load) please check if the problem still occurs?

Thanks,

Marc

rss

unread,
Jun 4, 2012, 6:04:46 AM6/4/12
to event-driv...@googlegroups.com
Hello,

unfotunately the issue is still there.
RHEL6, tac_plus.201205102052.

# top
top - 11:44:26 up 12 days,  2:24,  1 user,  load average: 1.09, 1.03, 0.94
Tasks: 120 total,   2 running, 118 sleeping,   0 stopped,   0 zombie
Cpu(s): 14.8%us, 35.3%sy,  0.0%ni, 49.9%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   1019768k total,   916896k used,   102872k free,    79488k buffers
Swap:  4871496k total,    69848k used,  4801648k free,   452840k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 3268 root      20   0 13268 1484  932 R 100.0  0.1  81:51.59 tac_plus
    1 root      20   0 19204  996  820 S  0.0  0.1   0:01.37 init
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.06 kthreadd
    3 root      RT   0     0    0    0 S  0.0  0.0   0:01.17 migration/0
    4 root      20   0     0    0    0 S  0.0  0.0  49:36.83 ksoftirqd/0
    5 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/0
#########################
# ps aux | grep tac
root      3264  0.0  0.0  13136   676 ?        Ss   Jun03   0:00 tac_plus: 0 connections, accepting up to 480 more
root      3266  0.0  0.1  13392  1548 ?        Ss   Jun03   0:00 tac_plus: 0 connections
root      3268  4.9  0.1  13268  1484 ?        Rs   Jun03  97:04 tac_plus: 0 connections
#########################
Config:

setenv IO_POLL_MECHANISM=8
id = spawnd {
        listen = { port = 49 }
}

id = tac_plus {
access log = /var/log/tac_plus.log
authorization log = /var/log/tac_plus_authorization.log
accounting log = /var/log/tac_acc.log

host.....
...
user = xxx {
   login = crypt ...
   enable 15 = crypt ...
   member = ...
   }
....
group.....
...

###########################
is it possible to provide some commands or steps... to generate some usable output to troubleshoote the issue?

russ

Aaron Turner

unread,
Jun 4, 2012, 12:14:11 PM6/4/12
to event-driv...@googlegroups.com
Could be a different issue... I haven't see the problem since applying
the patch. Not saying my problem is solved yet, but just I haven't
seen it *yet*. :)
> --
> You received this message because you are subscribed to the Google Groups
> "Event-Driven Servers" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/event-driven-servers/-/pnH9W2qgdeIJ.

Marc Huber

unread,
Jun 7, 2012, 7:50:01 AM6/7/12
to Event-Driven Servers
Hi Russ,

On 4 Jun., 12:04, rss <russl...@gmail.com> wrote:
> unfotunately the issue is still there.
...
> is it possible to provide some commands or steps... to generate some usable
> output to troubleshoote the issue?

nothing non-trivial. You'd need to compile with --debug and be
familiar with gdb.

Are there any log messages from PID 3268 that could be helpful?

Thanks,

Marc

Aaron Turner

unread,
Jun 25, 2012, 1:09:56 PM6/25/12
to event-driv...@googlegroups.com
FYI, I've seen this problem again on my server. :(

Recompiling with debug support.

Marc, if you have any suggestions/ideas what to do/look for next time
it happens let me know. I've used gdb more then a few times, but I'm
not familiar with the tac_plus code.
> --
> You received this message because you are subscribed to the Google Groups "Event-Driven Servers" group.

Marc Huber

unread,
Jul 2, 2012, 1:22:45 PM7/2/12
to event-driv...@googlegroups.com
On Monday, June 25, 2012 7:09:56 PM UTC+2, synfinatic wrote:
FYI, I've seen this problem again on my server. :(

Unfortunately, debugging this issue is non-trivial, as it apparently can't be reproduced at will.

Marc, if you have any suggestions/ideas what to do/look for next time
it happens let me know.  I've used gdb more then a few times, but I'm
not familiar with the tac_plus code.

It might be helpful to see what file descriptors the high-cpu process has in use, and what file descriptors the actual pollfd structs actually reference. Alas, obtaining that data depends is non-trivial, too.

I've changed some code yesterday. However, as I can't seem to reproduce the problem, I don't know if that will improve anything.

Thanks,

Marc

Rodrigo Pescador

unread,
Jul 26, 2012, 6:20:16 PM7/26/12
to event-driv...@googlegroups.com

Hello Marc,

I'm experiencing the problem right now. Have you already detected the problem?

Below some informations I've collected on my server. Is there anything else I could get to be useful?

Thanks!


root@saa-s1:~# ps aux | grep tac
root     11247  0.0  0.0  10488   696 ?        Ss   Jul23   0:05 tac_plus: 0 connections, accepting up to 900 more                            
root     19267  0.1  0.0  13608  1992 ?        Ss   Jul23   5:47 tac_plus: 0 connections                            
root     19268  0.0  0.5  49324 11460 ?        S    Jul23   0:39 /usr/bin/perl -w /usr/local/lib/mavis/mavis_tacplus_ldap.pl
root     20686 99.6  0.0  13116  1444 ?        Rs   17:11  90:41 tac_plus: 0 connections                            
root     20687  0.0  0.3  27796  7204 ?        S    17:11   0:00 /usr/bin/perl -w /usr/local/lib/mavis/mavis_tacplus_ldap.pl
root     20688 99.6  0.0  13320  1700 ?        Rs   17:11  90:42 tac_plus: 0 connections                            
root     20689  0.0  0.5  49192 11356 ?        S    17:11   0:00 /usr/bin/perl -w /usr/local/lib/mavis/mavis_tacplus_ldap.pl
root     22119  0.0  0.0   7548   852 pts/2    S+   18:42   0:00 grep tac


root@saa-s1:~# top

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                        
20686 root      20   0 13116 1444  872 R   99  0.1  92:49.65 tac_plus                                                                       
20688 root      20   0 13320 1700  908 R   99  0.1  92:50.58 tac_plus                                                                       
    1 root      20   0  8352  800  668 S    0  0.0   2:13.36 init                                                                           
    2 root      20   0     0    0    0 S    0  0.0   0:01.28 kthreadd                                                                       
    3 root      RT   0     0    0    0 S    0  0.0   0:03.54 migration/0




Marc Huber

unread,
Jul 28, 2012, 1:21:47 AM7/28/12
to event-driv...@googlegroups.com
Hi Rodrigo,


On Friday, July 27, 2012 12:20:16 AM UTC+2, Rodrigo Pescador wrote:
I'm experiencing the problem right now. Have you already detected the problem?

no ... can't reproduce it in my lab environment, so I can only make a guess -- the latest one is that there might be a (yet unfound) problem with log flushing. I've disabled the code that checks for logs being flushed in the latest snapshot.
 
Below some informations I've collected on my server. Is there anything else I could get to be useful?

  ls -l /proc/$(pidof tac_plus)/fd/

might be helpful.

Thanks,

Marc

Brett Barton

unread,
Aug 13, 2012, 12:31:16 PM8/13/12
to event-driv...@googlegroups.com
I was still experiencing the problem since my last update. Eventually it seems to go away but it takes quite a while.

That being said I have downloaded DEVEL.201207271858.tar.bz2 and installed it. Ill see how that goes.

Despite this problem, still LOVING tac_plus!

-Brett

Brett Barton

unread,
Aug 17, 2012, 1:50:15 PM8/17/12
to event-driv...@googlegroups.com
Just noticed today that tac_plus is again consuming 99.9% of the CPU time according to top. This is using DEVEL.201207271858.tar.bz2

-Brett

Marc Huber

unread,
Aug 18, 2012, 4:21:20 AM8/18/12
to event-driv...@googlegroups.com
Hi Brett (and anyone else that sees the high-CPU issue),


On Friday, August 17, 2012 7:50:15 PM UTC+2, Brett Barton wrote:
Just noticed today that tac_plus is again consuming 99.9% of the CPU time according to top. This is using DEVEL.201207271858.tar.bz2

if it happens again, I'd be grateful for
 - a gdb backtrace ("gdb -p <process id>", followed by "bt")
 - a few lines of strace (or truss, ktrace, ...) output ("strace -p <process id>")

Thanks,

Marc

Brett Barton

unread,
Aug 20, 2012, 11:02:07 AM8/20/12
to event-driv...@googlegroups.com
gdb backtrace:
(gdb) bt
#0  0xb77c8424 in __kernel_vsyscall ()
#1  0xb758f323 in read () from /lib/i686/cmov/libc.so.6
#2  0x0805f443 in tac_read ()
#3  0xb77b21d7 in io_poll () from /usr/local/sbin/../lib/libmavis.so
#4  0xb77b3c34 in io_main () from /usr/local/sbin/../lib/libmavis.so
#5  0x0805c932 in main ()

strace output

poll([{fd=0, events=POLLIN}, {fd=14, events=0}, {fd=13, events=0}, {fd=19, events=0}, {fd=11, events=POLLIN}, {fd=5, events=POLLIN}, {fd=9, events=POLLIN}, {fd=15, events=POLLIN}, {fd=12, events=POLLIN}, {fd=16, events=POLLIN}], 10, 14560) = 1 ([{fd=11, revents=POLLIN|POLLERR|POLLHUP}])
gettimeofday({1345474850, 924073}, NULL) = 0
read(11, ""..., 128)                    = 0
gettimeofday({1345474850, 924260}, NULL) = 0
poll([{fd=0, events=POLLIN}, {fd=14, events=0}, {fd=13, events=0}, {fd=19, events=0}, {fd=11, events=POLLIN}, {fd=5, events=POLLIN}, {fd=9, events=POLLIN}, {fd=15, events=POLLIN}, {fd=12, events=POLLIN}, {fd=16, events=POLLIN}], 10, 14560) = 1 ([{fd=11, revents=POLLIN|POLLERR|POLLHUP}])
gettimeofday({1345474850, 924488}, NULL) = 0
^CProcess 7520 detached

Does that look like the output you are looking for? I am not terribly familiar with these tools.

Marc Huber

unread,
Aug 20, 2012, 11:32:41 AM8/20/12
to event-driv...@googlegroups.com
Hi Brett,


On Monday, August 20, 2012 5:02:07 PM UTC+2, Brett Barton wrote:
Does that look like the output you are looking for? I am not terribly familiar with these tools.

yes, that's exactly what I was hoping for. Upload of 201208201725 is in progress, diff to the previous version is:

--- packet.c    2012/06/30 13:34:11    1.127
+++ packet.c    2012/08/20 15:23:30
@@ -433,7 +433,7 @@
     ctx->in->length = TAC_PLUS_HDR_SIZE + data_len;
     len = read(cur, (u_char *) & ctx->in->hdr + ctx->in->offset,
            ctx->in->length - ctx->in->offset);
-    if (len < 0) {
+    if (len <= 0) {
     if (errno != EAGAIN)
         cleanup(ctx, cur);
     return;

The bug would be triggered if a client aborts the TCP session after sending the TACACS+ header, which kind-of explains why I've never seen it in my (well-behaving) environment.

Thanks a lot!

Marc

Brett Barton

unread,
Aug 20, 2012, 1:02:49 PM8/20/12
to event-driv...@googlegroups.com
Great news! I have compiled and installed it!

Aaron Turner

unread,
Aug 20, 2012, 1:18:06 PM8/20/12
to event-driv...@googlegroups.com
So have I! :)

/me crosses fingers, eyes and toes
> --
> You received this message because you are subscribed to the Google Groups
> "Event-Driven Servers" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/event-driven-servers/-/khq_JVzuVAwJ.

Marc Huber

unread,
Aug 21, 2012, 11:02:17 AM8/21/12
to event-driv...@googlegroups.com
Thinking more about yesterday's patch, I strongly believe that it would break fragmented packets. Revised version:

--- packet.c    2012/08/20 18:16:41    1.128
+++ packet.c    2012/08/21 14:56:32
@@ -394,6 +394,7 @@
     char msg[80];
     u_int data_len;
     int more_keys = 0;
+    int min_len = 1;
 
     ctx->last_io = io_now.tv_sec;
 
@@ -406,6 +407,7 @@
         return;
     }
     ctx->hdroff += len;
+    min_len = 0;
     }
     if (ctx->hdroff != TAC_PLUS_HDR_SIZE)
     return;
@@ -433,7 +435,7 @@

     ctx->in->length = TAC_PLUS_HDR_SIZE + data_len;
     len = read(cur, (u_char *) & ctx->in->hdr + ctx->in->offset,
            ctx->in->length - ctx->in->offset);
-    if (len < 0) {
+    if (len < min_len) {

     if (errno != EAGAIN)
         cleanup(ctx, cur);
     return;

Uploading in a couple of seconds ...

Marc

Aaron Turner

unread,
Aug 21, 2012, 1:20:50 PM8/21/12
to event-driv...@googlegroups.com
Thanks Marc. I'll let you know if this solves the problem.

Brett Barton

unread,
Aug 22, 2012, 8:30:38 AM8/22/12
to event-driv...@googlegroups.com
Thanks for all the thought, time, and effort you have put into this project. I downloaded the last revision this morning and installed it.

Thanks!
Reply all
Reply to author
Forward
0 new messages