Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

HACMP 100% CPU usage no IP SWAP !?!? (2nd POST, URGENT HELP NEEDED)..

69 views
Skip to first unread message

Aleksandar

unread,
Oct 22, 2003, 2:30:58 AM10/22/03
to
Hello,

this is my second post since yesterday.I had a perfect HACMP cluster
with SSA disks till yesterday everithing was working preety much OK
(AIX 4.3.3 + Informix 7.31). My clsmuxd HACMP process now shows 100%
of the CPU used and i did not touched the HACMP configuration. What I
did was adding SSA array disks (1 logical disk) to shared VG since the
new disks are different in size with the old one, working fine ,
created some dbspaces there. More interesting, since I had performance
drop of 10times I forced HACMP to stop,expecting the machine to be
seen on its boot address BUT the adapter is still on its service
address ?!#@!?!. I have rebooted the source cluster machine,still the
same :-( .
cldare -t has failed w/ no error and a message :

cldare: An active Cluster Manager was detected elsewhere in the
cluster. This command must be run from a node with an active Cluster
Manager process in order for the Dynamic Reconfiguration to proceed.
The new configuration has been propogated to all nodes for your
convenience.

What has happened with my machines, can ODM differences between them
cause such behaviour, is it possible that heartbeat cable has some
problems, can I try cat < /dev/ttyX while the cluster is working ?

Please give me any sugestions, points, where should I look ehat should
I do?

Thanks

Aleksandar

Iain

unread,
Oct 23, 2003, 4:54:29 AM10/23/03
to
dam...@yahoo.com (Aleksandar) wrote in message news:<9040548b.03102...@posting.google.com>...


You don't say which version of HACMP you are running, but it probably
doesn't make much odds.

In my experience it is not unusual for HACMP to get itself confused
and mixed up. In this sort of situation, you really need an outage to
bring down all the nodes in the cluster, clean everything up (i.e.
make sure everything is is the boot address, vgs are varied off, apps
are stopped), then start it all up again and make sure it comes up
correctly.

Note : If you do a forced HACMP stop it just stops the daemons. It
does *not* do any config changes so it is correct that the adapter
stayed on the service address. The easiest way to put it on its boot
address is
chdev -l <adapter> -a state=up

Before restarting HACMP, you *must* clean everything up or it will not
start up correctly.

Iain.

Aleksandar

unread,
Oct 27, 2003, 9:54:15 AM10/27/03
to
Hi,
I was digging around my AIX and the HACMP configuration/verification
is clear.I am using Standard Aix authentication on both nodes.
I was getting these errors :

mbcb: Exit Status = 0.
mbcb:
dsh: 5025-509 mbcb rsh had exit code 1

what is dsh?


0513-056 Timeout waiting for command response. If you specified a
foreign host, see the /etc/inittab file on the foreign host to verify
that the SRC daemon (srcmstr) was started with the -r flag to accept
remote requests.

/etc/inittab is preety much OK, same as the other 3 HACMP
configurations I work on.

I can do rsh with no problem at all between machines and everithing
works fine, the cluster is going up nodes are joining together it is
just that clsmuxd still takes 100% od CPU utilization(1 out of 2
processors is out of service for the database processing)
non concurent HACMP 4.4.0.0 ,AIX 4.3.3

the only recent changes that i have made to my machines is installing
RPMs for ssh server :

openssh-3.4p1-3
prngd-0.9.23-2
zlib-1.1.4-2
openssh-clients-3.4p1-3
openssl-0.9.6e-2
openssh-server-3.4p1-3

still without cluster

thank you for the previous responce

Aleksandar

Iain

unread,
Oct 28, 2003, 8:07:57 AM10/28/03
to
dam...@yahoo.com (Aleksandar) wrote in message news:<9040548b.03102...@posting.google.com>...

> mbcb: Exit Status = 0.
> mbcb:
> dsh: 5025-509 mbcb rsh had exit code 1
>
> what is dsh?
>

The only dsh I know of is distributed shell which comes with the PSSP
cluster management software and allows you to run multiple kerberised
rsh's (e.g. from a control workstation run a command on every node in
one go).

>
> 0513-056 Timeout waiting for command response. If you specified a
> foreign host, see the /etc/inittab file on the foreign host to verify
> that the SRC daemon (srcmstr) was started with the -r flag to accept
> remote requests.
>

Try doing a refresh -s inetd on all nodes in the cluster (rebooting
them will have the same effect).

Iain.

Dan Foster

unread,
Oct 28, 2003, 10:56:29 AM10/28/03
to
In article <ac5718b7.03102...@posting.google.com>, Iain <ia...@cairnlin.co.uk> wrote:
> dam...@yahoo.com (Aleksandar) wrote in message news:<9040548b.03102...@posting.google.com>...
>
>> mbcb: Exit Status = 0.
>> mbcb:
>> dsh: 5025-509 mbcb rsh had exit code 1
>>
>> what is dsh?
>>
>
> The only dsh I know of is distributed shell which comes with the PSSP
> cluster management software and allows you to run multiple kerberised
> rsh's (e.g. from a control workstation run a command on every node in
> one go).

dsh is also now a standard part of AIX 5.2 via the csm.* filesets (csm =
cluster system management; one of the parts of PSSP that IBM extracted and
put into AIX 5.2).

It's indeed a very nice tool; we use it on our SP cluster extensively.

-Dan

Aleksandar

unread,
Nov 14, 2003, 9:49:33 AM11/14/03
to
Hi,

since I am still with no solution(clsmuxd HACMP process still shows
100%
of the CPU used) this is my last try before reconfiguring/reinstalling
HACMP and maybe even AIX :-<, since I guess this might be some OS
related(permissions,files missing....) problem.

I have been sersching a lot on google and from what I have seen I can
provide few more things:
/var/tmp/snmpd.conf:

10/22/03 07:05:06 NOTICE: SMUX open: 2 risc6000sampleAgents.2
"DPI->SMUX daemon" (12/ 127.0.0.1+32772+2)
10/22/03 07:05:06 NOTICE: SMUX packet from (127.0.0.1+32772+2)
10/22/03 07:05:06 NOTICE: SMUX register: readWrite 1.3.6.1.4.1.2.2.1.1
in = -1 out = 0 (127.0.0.1+32772+2)
...
10/27/03 15:40:47 NOTICE: SMUX relation started with
(127.0.0.1+33251+4)
10/27/03 15:45:04 NOTICE: SMUX packet from (127.0.0.1+33251+4)
10/27/03 15:45:04 EXCEPTIONS: unexpected operation: 2, missing
identity (SMUX 127.0.0.1+33251+4)
10/29/03 18:06:06 NOTICE: SMUX relation started with
(127.0.0.1+34905+5)
10/29/03 19:55:34 NOTICE: SMUX packet from (127.0.0.1+34905+5)
10/29/03 19:55:34 EXCEPTIONS: unexpected operation: 2, missing
identity (SMUX 127.0.0.1+34905+5)

may it be related somehow?
how to tell the clsmuxd deamon not to use snmp (if possible)?

cluster verification is going fine no errors.

/etc/inetd.conf:

ftp stream tcp6 nowait root /usr/sbin/ftpd ftpd
telnet stream tcp6 nowait root /usr/sbin/telnetd telnetd
-a
shell stream tcp6 nowait root /usr/sbin/rshd rshd
kshell stream tcp nowait root /usr/sbin/krshd krshd
login stream tcp6 nowait root /usr/sbin/rlogind rlogind
klogin stream tcp nowait root /usr/sbin/krlogind
krlogind
exec stream tcp6 nowait root /usr/sbin/rexecd rexecd
##comsat dgram udp wait root /usr/sbin/comsat comsat
##uucp stream tcp nowait root /usr/sbin/uucpd uucpd
##bootps dgram udp wait root /usr/sbin/bootpd
bootpd /etc/bootptab
##
## Finger, systat and netstat give out user information which may be
## valuable to potential "system crackers." Many sites choose to
disable
## some or all of these services to improve security.
##
##finger stream tcp nowait nobody /usr/sbin/fingerd fingerd
##systat stream tcp nowait nobody /usr/bin/ps
ps -ef
##netstat stream tcp nowait nobody /usr/bin/netstat
netstat -f inet
#
##tftp dgram udp6 SRC nobody /usr/sbin/tftpd tftpd
-n
##talk dgram udp wait root /usr/sbin/talkd talkd
ntalk dgram udp wait root /usr/sbin/talkd talkd
#
# rexd uses very minimal authentication and many sites choose to
disable
# this service to improve security.
#
##rquotad sunrpc_udp udp wait root
/usr/sbin/rpc.rquotad rquotad 100011 1
##rexd sunrpc_tcp tcp wait root /usr/sbin/rpc.rexd
rexd 100017 1
rstatd sunrpc_udp udp wait root /usr/sbin/rpc.rstatd
rstatd 100001 1-3
rusersd sunrpc_udp udp wait root
/usr/lib/netsvc/rusers/rpc.rusersd rusersd 100002 1-2
rwalld sunrpc_udp udp wait root
/usr/lib/netsvc/rwall/rpc.rwalld rwalld 100008 1
sprayd sunrpc_udp udp wait root
/usr/lib/netsvc/spray/rpc.sprayd sprayd 100012 1
pcnfsd sunrpc_udp udp wait root /usr/sbin/rpc.pcnfsd
pcnfsd 150001 1-2
echo stream tcp nowait root internal
discard stream tcp nowait root internal
chargen stream tcp nowait root internal
daytime stream tcp nowait root internal
time stream tcp nowait root internal
echo dgram udp wait root internal
discard dgram udp wait root internal
chargen dgram udp wait root internal
daytime dgram udp wait root internal
time dgram udp wait root internal
## The following line is for installing over the network.
##instsrv stream tcp nowait netinst /u/netinst/bin/instsrv
instsrv -r /tmp/netinstalllog /u/netinst/scripts
ttdbserver sunrpc_tcp tcp wait root
/usr/dt/bin/rpc.ttdbserver rpc.ttdbserver 100083 1
dtspc stream tcp nowait root /usr/dt/bin/dtspcd
/usr/dt/bin/dtspcd
cmsd sunrpc_udp udp wait root /usr/dt/bin/rpc.cmsd
cmsd 100068 2-5
##imap2 stream tcp nowait root /usr/sbin/imapd imapd
##pop3 stream tcp nowait root /usr/sbin/pop3d pop3d
godm stream tcp nowait root /usr/sbin/cluster/godmd


/etc/inittab:

init:2:initdefault:
brc::sysinit:/sbin/rc.boot 3 >/dev/console 2>&1 # Phase 3 of system
boot
powerfail::powerfail:/etc/rc.powerfail 2>&1 | alog -tboot >
/dev/console # Power Failure Detection
load64bit:2:wait:/etc/methods/cfg64 >/dev/console 2>&1 # Enable 64-bit
execs
rc:2:wait:/etc/rc 2>&1 | alog -tboot > /dev/console # Multi-User
checks
fbcheck:2:wait:/usr/sbin/fbcheck 2>&1 | alog -tboot > /dev/console #
run /etc/firstboot
srcmstr:2:respawn:/usr/sbin/srcmstr # System Resource Controller
prng:2:wait:/usr/bin/startsrc -sprngd
harc:2:wait:/usr//sbin/cluster/etc/harc.net # HACMP for AIX network
startup
rctcpip:a:wait:/etc/rc.tcpip > /dev/console 2>&1 # Start TCP/IP
daemons
rcnfs:a:wait:/etc/rc.nfs > /dev/console 2>&1 # Start NFS Daemons
cron:2:respawn:/usr/sbin/cron
piobe:2:wait:/usr/lib/lpd/pio/etc/pioinit >/dev/null 2>&1 # pb
cleanup
qdaemon:a:wait:/usr/bin/startsrc -sqdaemon
writesrv:a:wait:/usr/bin/startsrc -swritesrv
uprintfd:2:respawn:/usr/sbin/uprintfd
l2:2:wait:/etc/rc.d/rc 2
l3:3:wait:/etc/rc.d/rc 3
l4:4:wait:/etc/rc.d/rc 4
l5:5:wait:/etc/rc.d/rc 5
l6:6:wait:/etc/rc.d/rc 6
l7:7:wait:/etc/rc.d/rc 7
l8:8:wait:/etc/rc.d/rc 8
l9:9:wait:/etc/rc.d/rc 9
diagd:2:once:/usr/lpp/diagnostics/bin/diagd >/dev/console 2>&1
pmd:2:wait:/usr/bin/pmd > /dev/console 2>&1 # Start PM daemon
logsymp:2:once:/usr/lib/ras/logsymptom # for system dumps
httpdlite:2:once:/usr/IMNSearch/httpdlite/httpdlite -r
/etc/IMNSearch/httpdlite/httpdlite.conf & >/dev/console 2>&1
cons:0123456789:respawn:/usr/sbin/getty /dev/console
clinit:a:wait:/bin/touch /usr//sbin/cluster/.telinit # HACMP for AIX
These must be the last entries in inittab!
pst_clinit:a:wait:/bin/echo Created /usr//sbin/cluster/.telinit >
/dev/console # HACMP for AIX These must be the last entries in
inittab!
tty1:2:off:/usr/sbin/getty /dev/tty1
APCdaemon:2:wait:/etc/rc.APCupsd start #POWERCHUTE
tty2:2:off:/usr/sbin/getty /dev/tty2
hacmp6000:2:wait:/usr//sbin/cluster/etc/rc.cluster -boot -l -i -b #
Bring up Cluster

Note if anything else not posted can be of any help?

How to se (verbose) what the process is actually doing(calls,files
open,searches,requests,kernel extensions .....)


Aleksandar

Jose Pina Coelho

unread,
Nov 29, 2003, 2:16:33 PM11/29/03
to
dam...@yahoo.com (Aleksandar) wrote in news:9040548b.0311140649.78d35f90
@posting.google.com:

> Hi,
>
> since I am still with no solution(clsmuxd HACMP process still shows
> 100%
> of the CPU used) this is my last try before reconfiguring/reinstalling
> HACMP and maybe even AIX :-<, since I guess this might be some OS
> related(permissions,files missing....) problem.
>
> I have been sersching a lot on google and from what I have seen I can
> provide few more things:
> /var/tmp/snmpd.conf:
>
> may it be related somehow?
> how to tell the clsmuxd deamon not to use snmp (if possible)?
>
> cluster verification is going fine no errors.
If this is aix 5.2, the default is snmpd v3, hacmp uses v1.

Change it with:
stopsrc -s snmpd
snmpd_ssw -1
startsrc -s snmpd
startsrc -s dpid2


--
Doing AIX support was the most monty-pythonesque
activity available at the time.
Eagerly awaiting my thin chocolat mint.

0 new messages