[Rocks-Discuss] torque marks compute node as down

Kirk Peterson

unread,

Jan 4, 2012, 7:19:04 PM1/4/12

to Rocks-Discuss Rocks-Discuss List

Hi,

while the subject line is different, this is an update to my previous post on Dec. 18th. I've now upgraded my cluster to the latest Rocks 5.4.3 but still seem to have the same problem. Namely the torque pbs_server will periodically mark the compute node (this is only a 2-node cluster) as being down even though it seems to be pingable. The output in the pbs server log file always says something like:

01/04/2012 15:33:54;0004;PBS_Server;Svr;check_nodes;node compute-0-0 not detected in 177 seconds, marking node down

The offending compute node is a 4-way Opteron 6274 (4x16-core) box (Mercury RM206 server from Advanced HPC). The curious thing is how sporadic it is. While typing this email I just noticed that pbsnodes now reports the compute node as free.

Any suggestions for things to look into?

thanks in advance,

Kirk Peterson

Bart Brashers

unread,

Jan 4, 2012, 7:49:59 PM1/4/12

to Discussion of Rocks Clusters

Is the node under heavy load while this happens? If either the node or the FE is under very heavy load, then "pings" can get dropped. If enough of them drop, then the node gets marked as down. Perhaps the job(s) is writing so much output to the FE that it's saturating the interface, so "pings" are being dropped. And by "pings" I don't mean /bin/ping, but the call and response that pbs_server (on FE) and pbs_mom (on compute) do.

Like you say, eventually the job finishes and the node starts answering pings again, and get marked as "free" (not down). So the problem fixes itself. Is it only annoying to you, or does it affect other things (e.g. the next job doesn't start until the node is marked as "free", which takes 300 seconds)?

What sort of jobs are running when this happens? You might look at the logs in /opt/torque/*/logs when it's marked as down and see what is running, and who is running it. It may be the user is trying to awk lines from a 17 GB text file and pipe that to grep. Don't laugh, I've had users try that, and wonder what's wrong.

Bart Brashers

________________________________
This message contains information that may be confidential, privileged or otherwise protected by law from disclosure. It is intended for the exclusive use of the Addressee(s). Unless you are the addressee or authorized agent of the addressee, you may not review, copy, distribute or disclose to anyone the message or any information contained within. If you have received this message in error, please contact the sender by electronic reply to em...@environcorp.com and immediately delete all copies of the message.

Kirk Peterson

unread,

Jan 4, 2012, 10:21:13 PM1/4/12

to Discussion of Rocks Clusters

Actually the strange thing is that this happens when absolutely nothing is running (at least any user jobs) on the node. It will be up marked as up and then after a while the frontend won't hear from it and then it gets marked as down. It seems to be independent of the load on the frontend as well (which is generally very lightly used). There are only 2 users on this box.

-Kirk

Ryan Golhar

unread,

Jan 4, 2012, 10:47:25 PM1/4/12

to Discussion of Rocks Clusters

On Wed, Jan 4, 2012 at 10:21 PM, Kirk Peterson <kipe...@wsu.edu> wrote:

>
> Actually the strange thing is that this happens when absolutely nothing is
> running (at least any user jobs) on the node. It will be up marked as up
> and then after a while the frontend won't hear from it and then it gets
> marked as down. It seems to be independent of the load on the frontend as
> well (which is generally very lightly used). There are only 2 users on
> this box.
>
> -Kirk

Could it be a switch issue? This may seem a little overkill, but can you
put the front-end NIC card in promiscuous mode and sniff for packets to see
if they are getting through?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20120104/34cc1f9d/attachment.html

Kirk Peterson

unread,

Jan 5, 2012, 12:44:54 AM1/5/12

to Discussion of Rocks Clusters

Umm, not sure how to do this. Today I did bypass the GigE switch and just went directly from the compute node to
the frontend with a crossover cable. The problem however persisted.

Ryan Golhar

unread,

Jan 5, 2012, 1:02:24 AM1/5/12

to Discussion of Rocks Clusters

On Thu, Jan 5, 2012 at 12:44 AM, Kirk Peterson <kipe...@wsu.edu> wrote:

> Umm, not sure how to do this. Today I did bypass the GigE switch and just
> went directly from the compute node to
> the frontend with a crossover cable. The problem however persisted.
>
>

Just shooting in the dark here with things I would check...

1) Is pbs_mom running on the compute nodes?
2) Do the pbs_mom logs on the compute nodes show any errors?
3) From the head node, can you telnet to the compute nodes on port 15002
and 15003, and can you telnet from the compute node to the head node on
port 15001?

If #3 is successful, try restarting pbs_mom services on the compute nodes
and see if pbsnodes (or qnodes) reports the nodes as up.

-------------- next part --------------
An HTML attachment was scrubbed...

URL: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20120105/cb019d44/attachment.html

Kirk Peterson

unread,

Jan 5, 2012, 1:22:42 AM1/5/12

to Discussion of Rocks Clusters

Ryan,

pbs_mom is always running on the compute node
pbs_mom logs show no errors on the compute node
I can make a connection with telnet on the ports below, but the connections are always closed:

telnet c0-0 15002
Trying 10.1.255.254...
Connected to compute-0-0.local (10.1.255.254).
Escape character is '^]'.

+2+15+15056+0+72+56Bad DIS based Request Protocol MSG=cannot decode messageConnection closed by foreign host.

telnet c0-0 15003
Trying 10.1.255.254...
Connected to compute-0-0.local (10.1.255.254).
Escape character is '^]'.

Connection closed by foreign host.

101 compute-0-0.local> telnet mpc.local 15001
Trying 10.1.1.1...
Connected to mpc.local (10.1.1.1).
Escape character is '^]'.

+2+15+15056+0+72+56Bad DIS based Request Protocol MSG=cannot decode messageConnection closed by foreign host.

I have a feeling this is normal behavior for telnet though since this service is usually not allowed.

Is there way to determine if perhaps the ethernet NIC on this new compute node is maybe a bit flaky?

-Kirk

Ryan Golhar

unread,

Jan 5, 2012, 11:26:47 AM1/5/12

to Discussion of Rocks Clusters

It sounds like your nodes can talk to each other so its not a communication
issue.. You can monitor the NIC on the new node by putting the NIC in
promiscuous mode and watch the packets coming over the wire using a packet
sniffer, but I'm not convinced that's the issue. If you have a second NIC
card in the compute node, try using the second part or even a different
port. I'm at a loss right now as to what to try next.

-------------- next part --------------
An HTML attachment was scrubbed...

URL: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20120105/50c4679c/attachment.html

Kirk Peterson

unread,

Jan 5, 2012, 12:03:29 PM1/5/12

to Discussion of Rocks Clusters

One clue I should mention again is the strange output of momctl :

# momctl -d 3 -h compute-0-0

Host: compute-0-0.local/compute-0-0.local Version: 2.4.11 PID: 12168
Server[0]: mpc.local (10.1.1.1:15001)
Init Msgs Received: 0 hellos/1 cluster-addrs
Init Msgs Sent: 1 hellos
Last Msg From Server: 63626 seconds (DeleteJob)
Last Msg To Server: -115 seconds

I've never seen the negative time on the "last message to server" line in my other clusters. I've seen as large as -459 seconds shown from this
node, it slowly marches towards positive values, and then after a while it comes up negative again. I think the node is supposed
to send a ping to the server every 45 seconds, but perhaps the negative times are messing this up?

-Kirk

Bart Brashers

unread,

Jan 5, 2012, 1:13:53 PM1/5/12

to Discussion of Rocks Clusters

Is there a clock skew between your nodes and the frontend? What's the output of

# date ; \ssh -x compute-0-0 date

Bart

> -----Original Message-----
> From: npaci-rocks-dis...@sdsc.edu [mailto:npaci-rocks-
> discussio...@sdsc.edu] On Behalf Of Kirk Peterson

Kirk Peterson

unread,

Jan 5, 2012, 2:06:04 PM1/5/12

to Discussion of Rocks Clusters

Yes, big time (no pun intended):

# date ; ssh -x compute-0-0 date

Thu Jan 5 11:04:58 PST 2012
Thu Jan 5 11:08:59 PST 2012

-Kirk

Bart Brashers

unread,

Jan 5, 2012, 3:01:59 PM1/5/12

to Discussion of Rocks Clusters

Take a look in var/log/daemon, search for 'compute-0-0.local ntpd', and see what it says. It should say it's listening on interface eth0, and synchronizing to the IP of your FE's eth0 (10.0.0.1 by default). Something like this example:

# grep 'compute-0-0.local ntpd' /var/log/daemon
Dec 6 17:22:39 compute-0-0.local ntpd[6402]: kernel time sync status 0040
Dec 6 17:27:01 compute-0-0.local ntpd[6402]: synchronized to 192.168.80.1, stratum 3
Dec 6 17:37:45 compute-0-0.local ntpd[6402]: kernel time sync enabled 0001
Dec 6 19:09:24 compute-0-0.local ntpdate[6461]: step time server 192.168.80.1 offset 0.484697 sec
Dec 6 19:09:24 compute-0-0.local ntpd[6465]: ntpd 4.2...@1.1570-o Sat Dec 19 00:56:13 UTC 2009 (1)
Dec 6 19:09:24 compute-0-0.local ntpd[6466]: precision = 1.000 usec
Dec 6 19:09:24 compute-0-0.local ntpd[6466]: Listening on interface wildcard, 0.0.0.0#123 Disabled
Dec 6 19:09:24 compute-0-0.local ntpd[6466]: Listening on interface lo, 127.0.0.1#123 Enabled
Dec 6 19:09:24 compute-0-0.local ntpd[6466]: Listening on interface eth0, 192.168.255.254#123 Enabled
Dec 6 19:09:24 compute-0-0.local ntpd[6466]: kernel time sync status 0040
Dec 6 19:13:46 compute-0-0.local ntpd[6466]: synchronized to 192.168.80.1, stratum 3
Dec 6 19:34:18 compute-0-0.local ntpd[6466]: kernel time sync enabled 0001

If you don't see stuff like that, then ntpd is not working on the compute node, which would explain the clock drift. Check if ntpd is running on the compute node and on the FE:

$ cat /usr/local/bin/psgrep
#!/bin/tcsh -f
ps aux | head -1
ps aux | grep -i $* | grep -v /usr/local/bin/psgrep | grep -v "grep -i $*"

$ psgrep ntpd
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
ntp 5502 0.0 0.0 23416 5048 ? SLs 2011 0:04 ntpd -A -u ntp:ntp -p /var/run/ntpd.pid

$ ssh compute-0-0 'psgrep ntpd'
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
ntp 6466 0.0 0.0 19212 4904 ? SLs 2011 0:00 ntpd -A -u ntp:ntp -p /var/run/ntpd.pid

If they are not both running, then look in the logs for clues why they stopped.

Rocks standard set-up for ntpd should look like this:

$ ssh compute-0-0 'cat /etc/ntp.conf'
server 192.168.80.1
driftfile /var/lib/ntp/drift

Bart Brashers

Kirk Peterson

unread,

Jan 5, 2012, 3:22:54 PM1/5/12

to Discussion of Rocks Clusters

I've been trying to look into this ever since you got me looking at the time skew issue. The /var/log/daemon file has entries like:

Jan 5 11:00:33 compute-0-0.local ntpd[23501]: ntpd 4.2...@1.1570-o Sat Dec 19 00:56:13 UTC 2009 (1)
Jan 5 11:00:33 compute-0-0.local ntpd[23502]: precision = 1.000 usec
Jan 5 11:00:33 compute-0-0.local ntpd[23502]: Listening on interface wildcard, 0.0.0.0#123 Disabled
Jan 5 11:00:33 compute-0-0.local ntpd[23502]: Listening on interface lo, 127.0.0.1#123 Enabled
Jan 5 11:00:33 compute-0-0.local ntpd[23502]: Listening on interface eth1, 10.1.255.254#123 Enabled
Jan 5 11:00:33 compute-0-0.local ntpd[23502]: kernel time sync status 0040
Jan 5 11:00:33 compute-0-0.local ntpd[23502]: frequency initialized 0.000 PPM from /var/lib/ntp/drift
Jan 5 11:00:47 compute-0-0.local ntpd[23502]: ntpd exiting on signal 15
Jan 5 11:00:48 compute-0-0.local ntpdate[23532]: step time server 10.1.1.1 offset -14.217548 sec

ntpd is running on both the frontend and compute-0-0. Any idea why ntpd exits on signal 15?

/etc/ntp.conf on compute-0-0 looks like:

server 10.1.1.1
driftfile /var/lib/ntp/drift

-Kirk

Bart Brashers

unread,

Jan 5, 2012, 4:01:35 PM1/5/12

to Discussion of Rocks Clusters

Could be that the time difference is too large. I think ntpd will cowardly refuse to change the time if the time difference is too great.

Try this:

[root@wclyn-linux1 ~]# tcsh
[root ~]# set now = `date` ; ssh compute-0-0 date -s \"$now\"

That will set the time on the compute node to something pretty close to the same as on the FE. Then watch /var/log/daemon as the afternoon progresses, and see if it stays alive instead if dying with signal 15.

Hopefully, this will solve all your problems, and we can all signal 15 happily. Pun fully intended.

Bart

> -----Original Message-----
> From: npaci-rocks-dis...@sdsc.edu [mailto:npaci-rocks-
> discussio...@sdsc.edu] On Behalf Of Kirk Peterson
> Sent: Thursday, January 05, 2012 12:23 PM
> To: Discussion of Rocks Clusters
> Subject: Re: [Rocks-Discuss] torque marks compute node as down
>
>

Kirk Peterson

unread,

Jan 5, 2012, 6:47:44 PM1/5/12

to Discussion of Rocks Clusters

Nope, no joy. Note that after ntpd exits with a signal 15, the system calls ntpdate, which syncs with the frontend. The compute node then very quickly slews out of sync - in a 25 minute period it moves forward 25 min - so it's somehow in double time. Could this be some sort of hardware issue?

-Kirk

PS - I guess this has moved out of Rocks territory, but perhaps there is interest if anyone comes across one of the new AMD 6274 servers.

Kirk Peterson

unread,

Jan 5, 2012, 7:42:39 PM1/5/12

to Discussion of Rocks Clusters

Fixed. After some significant googling, I found that others had a similar problem with the clock running a factor of 2 too fast. The solution is to give the noapic kernel option at boot. Seems to have fixed the problem.

Thanks to all for your help - especially Bart for pointing me in the right direction.

-Kirk

Bart Brashers

unread,

Jan 5, 2012, 7:45:21 PM1/5/12

to Discussion of Rocks Clusters

Yes, quite possibly a hardware failure.

Reply all

Reply to author

Forward