while the subject line is different, this is an update to my previous post on Dec. 18th. I've now upgraded my cluster to the latest Rocks 5.4.3 but still seem to have the same problem. Namely the torque pbs_server will periodically mark the compute node (this is only a 2-node cluster) as being down even though it seems to be pingable. The output in the pbs server log file always says something like:
01/04/2012 15:33:54;0004;PBS_Server;Svr;check_nodes;node compute-0-0 not detected in 177 seconds, marking node down
The offending compute node is a 4-way Opteron 6274 (4x16-core) box (Mercury RM206 server from Advanced HPC). The curious thing is how sporadic it is. While typing this email I just noticed that pbsnodes now reports the compute node as free.
Any suggestions for things to look into?
thanks in advance,
Kirk Peterson
Like you say, eventually the job finishes and the node starts answering pings again, and get marked as "free" (not down). So the problem fixes itself. Is it only annoying to you, or does it affect other things (e.g. the next job doesn't start until the node is marked as "free", which takes 300 seconds)?
What sort of jobs are running when this happens? You might look at the logs in /opt/torque/*/logs when it's marked as down and see what is running, and who is running it. It may be the user is trying to awk lines from a 17 GB text file and pipe that to grep. Don't laugh, I've had users try that, and wonder what's wrong.
Bart Brashers
________________________________
This message contains information that may be confidential, privileged or otherwise protected by law from disclosure. It is intended for the exclusive use of the Addressee(s). Unless you are the addressee or authorized agent of the addressee, you may not review, copy, distribute or disclose to anyone the message or any information contained within. If you have received this message in error, please contact the sender by electronic reply to em...@environcorp.com and immediately delete all copies of the message.
-Kirk
>
> Actually the strange thing is that this happens when absolutely nothing is
> running (at least any user jobs) on the node. It will be up marked as up
> and then after a while the frontend won't hear from it and then it gets
> marked as down. It seems to be independent of the load on the frontend as
> well (which is generally very lightly used). There are only 2 users on
> this box.
>
> -Kirk
Could it be a switch issue? This may seem a little overkill, but can you
put the front-end NIC card in promiscuous mode and sniff for packets to see
if they are getting through?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20120104/34cc1f9d/attachment.html
> Umm, not sure how to do this. Today I did bypass the GigE switch and just
> went directly from the compute node to
> the frontend with a crossover cable. The problem however persisted.
>
>
Just shooting in the dark here with things I would check...
1) Is pbs_mom running on the compute nodes?
2) Do the pbs_mom logs on the compute nodes show any errors?
3) From the head node, can you telnet to the compute nodes on port 15002
and 15003, and can you telnet from the compute node to the head node on
port 15001?
If #3 is successful, try restarting pbs_mom services on the compute nodes
and see if pbsnodes (or qnodes) reports the nodes as up.
-------------- next part --------------
An HTML attachment was scrubbed...
pbs_mom is always running on the compute node
pbs_mom logs show no errors on the compute node
I can make a connection with telnet on the ports below, but the connections are always closed:
telnet c0-0 15002
Trying 10.1.255.254...
Connected to compute-0-0.local (10.1.255.254).
Escape character is '^]'.
+2+15+15056+0+72+56Bad DIS based Request Protocol MSG=cannot decode messageConnection closed by foreign host.
telnet c0-0 15003
Trying 10.1.255.254...
Connected to compute-0-0.local (10.1.255.254).
Escape character is '^]'.
Connection closed by foreign host.
101 compute-0-0.local> telnet mpc.local 15001
Trying 10.1.1.1...
Connected to mpc.local (10.1.1.1).
Escape character is '^]'.
+2+15+15056+0+72+56Bad DIS based Request Protocol MSG=cannot decode messageConnection closed by foreign host.
I have a feeling this is normal behavior for telnet though since this service is usually not allowed.
Is there way to determine if perhaps the ethernet NIC on this new compute node is maybe a bit flaky?
-Kirk
It sounds like your nodes can talk to each other so its not a communication
issue.. You can monitor the NIC on the new node by putting the NIC in
promiscuous mode and watch the packets coming over the wire using a packet
sniffer, but I'm not convinced that's the issue. If you have a second NIC
card in the compute node, try using the second part or even a different
port. I'm at a loss right now as to what to try next.
-------------- next part --------------
An HTML attachment was scrubbed...
# momctl -d 3 -h compute-0-0
Host: compute-0-0.local/compute-0-0.local Version: 2.4.11 PID: 12168
Server[0]: mpc.local (10.1.1.1:15001)
Init Msgs Received: 0 hellos/1 cluster-addrs
Init Msgs Sent: 1 hellos
Last Msg From Server: 63626 seconds (DeleteJob)
Last Msg To Server: -115 seconds
I've never seen the negative time on the "last message to server" line in my other clusters. I've seen as large as -459 seconds shown from this
node, it slowly marches towards positive values, and then after a while it comes up negative again. I think the node is supposed
to send a ping to the server every 45 seconds, but perhaps the negative times are messing this up?
-Kirk
# date ; \ssh -x compute-0-0 date
Bart
> -----Original Message-----
> From: npaci-rocks-dis...@sdsc.edu [mailto:npaci-rocks-
> discussio...@sdsc.edu] On Behalf Of Kirk Peterson
# date ; ssh -x compute-0-0 date
Thu Jan 5 11:04:58 PST 2012
Thu Jan 5 11:08:59 PST 2012
-Kirk
# grep 'compute-0-0.local ntpd' /var/log/daemon
Dec 6 17:22:39 compute-0-0.local ntpd[6402]: kernel time sync status 0040
Dec 6 17:27:01 compute-0-0.local ntpd[6402]: synchronized to 192.168.80.1, stratum 3
Dec 6 17:37:45 compute-0-0.local ntpd[6402]: kernel time sync enabled 0001
Dec 6 19:09:24 compute-0-0.local ntpdate[6461]: step time server 192.168.80.1 offset 0.484697 sec
Dec 6 19:09:24 compute-0-0.local ntpd[6465]: ntpd 4.2...@1.1570-o Sat Dec 19 00:56:13 UTC 2009 (1)
Dec 6 19:09:24 compute-0-0.local ntpd[6466]: precision = 1.000 usec
Dec 6 19:09:24 compute-0-0.local ntpd[6466]: Listening on interface wildcard, 0.0.0.0#123 Disabled
Dec 6 19:09:24 compute-0-0.local ntpd[6466]: Listening on interface lo, 127.0.0.1#123 Enabled
Dec 6 19:09:24 compute-0-0.local ntpd[6466]: Listening on interface eth0, 192.168.255.254#123 Enabled
Dec 6 19:09:24 compute-0-0.local ntpd[6466]: kernel time sync status 0040
Dec 6 19:13:46 compute-0-0.local ntpd[6466]: synchronized to 192.168.80.1, stratum 3
Dec 6 19:34:18 compute-0-0.local ntpd[6466]: kernel time sync enabled 0001
If you don't see stuff like that, then ntpd is not working on the compute node, which would explain the clock drift. Check if ntpd is running on the compute node and on the FE:
$ cat /usr/local/bin/psgrep
#!/bin/tcsh -f
ps aux | head -1
ps aux | grep -i $* | grep -v /usr/local/bin/psgrep | grep -v "grep -i $*"
$ psgrep ntpd
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
ntp 5502 0.0 0.0 23416 5048 ? SLs 2011 0:04 ntpd -A -u ntp:ntp -p /var/run/ntpd.pid
$ ssh compute-0-0 'psgrep ntpd'
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
ntp 6466 0.0 0.0 19212 4904 ? SLs 2011 0:00 ntpd -A -u ntp:ntp -p /var/run/ntpd.pid
If they are not both running, then look in the logs for clues why they stopped.
Rocks standard set-up for ntpd should look like this:
$ ssh compute-0-0 'cat /etc/ntp.conf'
server 192.168.80.1
driftfile /var/lib/ntp/drift
Bart Brashers
Jan 5 11:00:33 compute-0-0.local ntpd[23501]: ntpd 4.2...@1.1570-o Sat Dec 19 00:56:13 UTC 2009 (1)
Jan 5 11:00:33 compute-0-0.local ntpd[23502]: precision = 1.000 usec
Jan 5 11:00:33 compute-0-0.local ntpd[23502]: Listening on interface wildcard, 0.0.0.0#123 Disabled
Jan 5 11:00:33 compute-0-0.local ntpd[23502]: Listening on interface lo, 127.0.0.1#123 Enabled
Jan 5 11:00:33 compute-0-0.local ntpd[23502]: Listening on interface eth1, 10.1.255.254#123 Enabled
Jan 5 11:00:33 compute-0-0.local ntpd[23502]: kernel time sync status 0040
Jan 5 11:00:33 compute-0-0.local ntpd[23502]: frequency initialized 0.000 PPM from /var/lib/ntp/drift
Jan 5 11:00:47 compute-0-0.local ntpd[23502]: ntpd exiting on signal 15
Jan 5 11:00:48 compute-0-0.local ntpdate[23532]: step time server 10.1.1.1 offset -14.217548 sec
ntpd is running on both the frontend and compute-0-0. Any idea why ntpd exits on signal 15?
/etc/ntp.conf on compute-0-0 looks like:
server 10.1.1.1
driftfile /var/lib/ntp/drift
-Kirk
Try this:
[root@wclyn-linux1 ~]# tcsh
[root ~]# set now = `date` ; ssh compute-0-0 date -s \"$now\"
That will set the time on the compute node to something pretty close to the same as on the FE. Then watch /var/log/daemon as the afternoon progresses, and see if it stays alive instead if dying with signal 15.
Hopefully, this will solve all your problems, and we can all signal 15 happily. Pun fully intended.
Bart
> -----Original Message-----
> From: npaci-rocks-dis...@sdsc.edu [mailto:npaci-rocks-
> discussio...@sdsc.edu] On Behalf Of Kirk Peterson
> Sent: Thursday, January 05, 2012 12:23 PM
> To: Discussion of Rocks Clusters
> Subject: Re: [Rocks-Discuss] torque marks compute node as down
>
>
-Kirk
PS - I guess this has moved out of Rocks territory, but perhaps there is interest if anyone comes across one of the new AMD 6274 servers.
Thanks to all for your help - especially Bart for pointing me in the right direction.
-Kirk