I have install rocks cluster 5.1 on cluster which is a quadcore based. For some reason the compute nodes are not able to ping an external server . Eventhough I have stop the Firewall on the Front End and on each not.
The Front End can ping the external server and vice versa , but not each compute nodes. Where this is coming from ? any help ?
Thx
_________________________________________________________________
Vous voulez savoir ce que vous pouvez faire avec le nouveau Windows Live ? Lancez-vous !
http://www.microsoft.com/windows/windowslive/default.aspx
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20090323/978dea88/attachment.html
Don't forget, the nodes are on a separate network than the external network.
------------------
Michael Duncan
Systems Analyst
mdu...@x-iss.com
eXcellence in IS Solutions, Inc. (X-ISS)
Office:Â 713.862.9200 x215
http://www.x-iss.com
Making IT Work for You
HPC & Enterprise IT Solutions
Hei,
NOTICE:
This message may contain privileged or otherwise confidential information.
If you are not the intended recipient, please immediately advise the sender
by reply email and delete the message and any attachments without using,
copying or disclosing the contents.
What should I do to make it work ?
Thx
> Date: Mon, 23 Mar 2009 09:53:50 -0500
> From: MDu...@x-iss.com
> To: npaci-rocks...@sdsc.edu
> Subject: Re: [Rocks-Discuss] Compute node can't ping external server
_________________________________________________________________
Inédit ! Des Emoticônes Déjantées! Installez les dans votre Messenger !
http://www.ilovemessenger.fr/Emoticones/EmoticonesDejantees.aspx
-------------- next part --------------
An HTML attachment was scrubbed...
Second, can you give me a list of (from the FE)
[root@grid ~]# dbreport static-routes compute-0-0
#
# Do NOT Edit (generated by dbreport)
#
# Global routes
any net 224.0.0.0/4 dev eth0
any host 255.255.255.255 dev eth0
any host 192.168.1.11 gw 10.1.1.1
# Member routes
# Node Routes
[root@grid ~]#
As well as (on the FE)
[root@grid ~]# route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use
Iface
255.255.255.255 * 255.255.255.255 UH 0 0 0 eth0
grid.uhd.edu grid.local 255.255.255.255 UGH 0 0 0 eth0
192.168.1.0 * 255.255.255.0 U 0 0 0 eth1
10.1.0.0 * 255.255.0.0 U 0 0 0 eth0
169.254.0.0 * 255.255.0.0 U 0 0 0 eth1
224.0.0.0 * 240.0.0.0 U 0 0 0 eth0
default 192.168.1.254 0.0.0.0 UG 0 0 0 eth1
[root@grid ~]#
2009/3/23 p K <p_k...@hotmail.com>
--
Have you read the Stimulus? Our politicians haven't! <br>
<a href="http://readthestimulus.org/">Read the Stimulus</a>
-------------- next part --------------
An HTML attachment was scrubbed...
[root@froan log]# dbreport static-routes compute-0-0
#
# Do NOT Edit (generated by dbreport)
#
# Global routes
any net 224.0.0.0/4 dev eth0
any host 255.255.255.255 dev eth0
any host 129.241.249.140 gw 10.1.1.1
# Member routes
# Node Routes
[root@froan log]# route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
froan.fish.sint froan.local 255.255.255.255 UGH 0 0 0 eth0
255.255.255.255 * 255.255.255.255 UH 0 0 0 eth0
129.241.249.128 * 255.255.255.128 U 0 0 0 eth1
10.254.2.0 * 255.255.255.0 U 0 0 0 eth3
10.254.1.0 * 255.255.255.0 U 0 0 0 eth2
10.1.1.0 * 255.255.255.0 U 0 0 0 eth0
169.254.0.0 * 255.255.0.0 U 0 0 0 eth3
224.0.0.0 * 240.0.0.0 U 0 0 0 eth0
default cgw1-sab-051.si 0.0.0.0 UG 0 0 0 eth1
Thx
> Date: Mon, 23 Mar 2009 10:15:09 -0500
> From: drache...@gmail.com
Did you follow something from the guide? What do eth2,eth3 connect to?
This line from yours:
169.254.0.0 * 255.255.0.0 U 0 0 0 eth3
compared to mine:
169.254.0.0 * 255.255.0.0 U 0 0 0 eth1
makes me wonder if that's part of the problem. However, since the default
gw is on eth1, this should be fine.
Ok, next troubleshooting step...
ssh into compute-0-0 (assuming it's one of the ones that can't "dial out"
and run this:
[root@compute-0-0 ~]# traceroute google.com
traceroute to google.com (74.125.45.100), 30 hops max, 40 byte packets
1 grid.local (10.1.1.1) 0.238 ms 0.216 ms 0.193 ms
2 192.168.1.254 (192.168.1.254) 0.695 ms 1.340 ms 1.322 ms
3 192.0.2.100 (192.0.2.100) 15.234 ms 18.409 ms 21.065 ms
4 dist1-vlan50.hstntx.sbcglobal.net (151.164.11.126) 24.733 ms 30.832
ms 33.485 ms
5 bb1-g14-0.hstntx.sbcglobal.net (151.164.92.204) 36.776 ms 39.702 ms
43.123 ms
6 12.83.63.157 (12.83.63.157) 95.138 ms 98.749 ms 101.442 ms
7 asn15169-google.eqnwnj.sbcglobal.net (151.164.248.202) 143.078 ms
143.762 ms 144.276 ms
8 209.85.255.68 (209.85.255.68) 115.389 ms 104.935 ms 105.514 ms
9 216.239.46.48 (216.239.46.48) 114.676 ms 76.865 ms 63.938 ms
10 * *
[root@compute-0-0 ~]#
(note, I hit ctrl-c to break after it got to 9 steps) and then run that
again from the FE:
[root@grid ~]# tracert google.com
traceroute to google.com (209.85.171.100), 30 hops max, 40 byte packets
1 192.168.1.254 (192.168.1.254) 0.481 ms 1.085 ms 1.085 ms
2 192.0.2.100 (192.0.2.100) 14.595 ms 17.544 ms 21.218 ms
3 dist2-vlan60.hstntx.sbcglobal.net (151.164.11.189) 24.166 ms 27.861
ms 30.781 ms
4 bb2-g2-0.hstntx.sbcglobal.net (151.164.43.42) 34.469 ms 38.191 ms
41.109 ms
5 12.83.63.149 (12.83.63.149) 92.908 ms 96.616 ms 99.516 ms
6 * * *
7 * 209.85.255.68 (209.85.255.68) 88.422 ms 88.860 ms
8 216.239.46.227 (216.239.46.227) 92.386 ms 209.85.251.233
(209.85.251.233) 68.631 ms 74.938 ms
[root@grid ~]#
(again, I hit ctrl-c during the run)
Let's see what those numbers return...
2009/3/23 p K <p_k...@hotmail.com>
But I got this from the compute node :
[root@compute-0-0 ~]# traceroute google.com
google.com: Temporary failure in name resolution
Cannot handle "host" cmdline arg `google.com' on position 1 (argc 1)
[root@compute-0-0 ~]#
and this from the FE:
1 cgw1a-sab-051.sintef.no (129.241.249.135) 0.492 ms 0.575 ms 0.681 ms
2 sintef-fw1.sintef.no (129.241.249.65) 0.606 ms 0.653 ms 0.653 ms
3 129.241.249.73 (129.241.249.73) 1.209 ms 1.476 ms 1.700 ms
4 trd-gw.uninett.no (129.241.249.14) 1.080 ms 1.162 ms 1.201 ms
5 oslo-gw.uninett.no (128.39.65.73) 15.722 ms 15.818 ms 15.818 ms
6 se-tug.nordu.net (193.10.68.105) 16.607 ms 16.337 ms 16.299 ms
7 se-tug2.nordu.net (193.10.252.94) 16.279 ms 16.274 ms 16.273 ms
8 google-gw.nordu.net (193.10.68.42) 16.739 ms 16.731 ms 16.711 ms
9 209.85.252.186 (209.85.252.186) 16.493 ms 42.155 ms 16.625 ms
10 209.85.254.153 (209.85.254.153) 35.096 ms 35.177 ms 35.177 ms
11 216.239.48.10 (216.239.48.10) 53.581 ms 52.788 ms 54.929 ms
12 209.85.248.182 (209.85.248.182) 52.373 ms 52.359 ms 51.998 ms
13 72.14.233.62 (72.14.233.62) 62.244 ms 63.187 ms 62.659 ms
14 209.85.248.81 (209.85.248.81) 64.454 ms 64.260 ms 209.85.250.54 (209.85.250.54) 133.822 ms
15 209.85.251.233 (209.85.251.233) 151.664 ms 216.239.43.192 (216.239.43.192) 131.600 ms 131.952 ms
16 216.239.43.113 (216.239.43.113) 132.218 ms 132.794 ms 132.756 ms
17 209.85.251.233 (209.85.251.233) 152.001 ms 216.239.48.143 (216.239.48.143) 219.879 ms 216.239.48.141 (216.239.48.141) 219.865 ms
18 209.85.251.153 (209.85.251.153) 209.010 ms 209.85.251.133 (209.85.251.133) 216.511 ms 216.239.46.204 (216.239.46.204) 248.098 ms
19 216.239.48.143 (216.239.48.143) 210.939 ms 216.239.46.212 (216.239.46.212) 205.067 ms 203.956 ms
20 64.233.174.99 (64.233.174.99) 207.207 ms 64.233.174.97 (64.233.174.97) 206.755 ms 209.85.251.141 (209.85.251.141) 207.280 ms
21 74.125.30.2 (74.125.30.2) 213.103 ms cg-in-f100.google.com (209.85.171.100) 211.694 ms 210.687 ms
> Date: Mon, 23 Mar 2009 11:48:04 -0500
_________________________________________________________________
Téléphonez gratuitement à tous vos proches avec Windows Live Messenger ! Téléchargez-le maintenant !Â
http://www.windowslive.fr/messenger/1.asp
-------------- next part --------------
An HTML attachment was scrubbed...
Hei, Thanks for the help and sorry for the delay ?. Thanks
[root at compute-0-0 ~]# traceroute google.com
google.com: Temporary failure in name resolution
Cannot handle "host" cmdline arg `google.com' on position 1 (argc 1)
[root at compute-0-0 ~]#
and this from the FE:
1 cgw1a-sab-051.sintef.no (129.241.249.135) 0.492 ms 0.575 ms 0.681 ms
2 sintef-fw1.sintef.no (129.241.249.65) 0.606 ms 0.653 ms 0.653 ms
3 129.241.249.73 (129.241.249.73) 1.209 ms 1.476 ms 1.700 ms
4 trd-gw.uninett.no (129.241.249.14) 1.080 ms 1.162 ms 1.201 ms
5 oslo-gw.uninett.no (128.39.65.73) 15.722 ms 15.818 ms 15.818 ms
6 se-tug.nordu.net (193.10.68.105) 16.607 ms 16.337 ms 16.299 ms
7 se-tug2.nordu.net (193.10.252.94) 16.279 ms 16.274 ms 16.273 ms
8 google-gw.nordu.net (193.10.68.42) 16.739 ms 16.731 ms 16.711 ms
9 209.85.252.186 (209.85.252.186) 16.493 ms 42.155 ms 16.625 ms
10 209.85.254.153 (209.85.254.153) 35.096 ms 35.177 ms 35.177 ms
11 216.239.48.10 (216.239.48.10) 53.581 ms 52.788 ms 54.929 ms
12 209.85.248.182 (209.85.248.182) 52.373 ms 52.359 ms 51.998 ms
13 72.14.233.62 (72.14.233.62) 62.244 ms 63.187 ms 62.659 ms
14 209.85.248.81 (209.85.248.81) 64.454 ms 64.260 ms 209.85.250.54 (209.85.250.54) 133.822 ms
15 209.85.251.233 (209.85.251.233) 151.664 ms 216.239.43.192 (216.239.43.192) 131.600 ms 131.952 ms
16 216.239.43.113 (216.239.43.113) 132.218 ms 132.794 ms 132.756 ms
17 209.85.251.233 (209.85.251.233) 152.001 ms 216.239.48.143 (216.239.48.143) 219.879 ms 216.239.48.141 (216.239.48.141) 219.865 ms
18 209.85.251.153 (209.85.251.153) 209.010 ms 209.85.251.133 (209.85.251.133) 216.511 ms 216.239.46.204 (216.239.46.204) 248.098 ms
19 216.239.48.143 (216.239.48.143) 210.939 ms 216.239.46.212 (216.239.46.212) 205.067 ms 203.956 ms
20 64.233.174.99 (64.233.174.99) 207.207 ms 64.233.174.97 (64.233.174.97) 206.755 ms 209.85.251.141 (209.85.251.141) 207.280 ms
21 74.125.30.2 (74.125.30.2) 213.103 ms cg-in-f100.google.com (209.85.171.100) 211.694 ms 210.687 ms
> Date: Mon, 23 Mar 2009 11:48:04 -0500
_________________________________________________________________
Téléphonez gratuitement à tous vos proches avec Windows Live Messenger ! Téléchargez-le maintenant !
http://www.windowslive.fr/messenger/1.asp
-------------- next part --------------
An HTML attachment was scrubbed...
So there was one other question, how did you configure the eth2 and eth3?
were those already present during initial setup of Rocks, or were those
added in later?
2009/3/26 p K <p_k...@hotmail.com>
> 21 74.125.30.2 (74.125.30.2) 213.103 ms cg-in-f100.google.com(209.85.171.100) 211.694 ms 210.687 ms
The eth2 and eth3 have been installed later and I guess following the user guide (we didn't do it personally)
Is it this the problem ?. But usuallly do the compute nodes are outside the network than the FE or is it just our installation ?
Regards
> Date: Thu, 26 Mar 2009 14:22:56 -0500
I don't understand the questions in the last email. And if you didn't
install the cards personally, can you find out who did, and what steps they
took to install the cards? The following line is the part I don't
understand. Perhaps you can word the question differently? I presume the
first question to be "Is this (eth2 and eth3 configuration) the problem?" to
which my answer is "Perhaps so".
*Is it this the problem ?. But usuallly do the compute nodes are outside the
network than the FE or is it just our installation ?*
Did everything work correctly before eth2 and eth3 were added, or has the
system "never worked correctly"? Also, reading back through my
troubleshooting, I believe I overlooked something very simple. Did I ever
ask for (because I can't seem to find it) this command from compute-0-0?
[root@compute-0-0 ~]# route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use
Iface
255.255.255.255 * 255.255.255.255 UH 0 0 0 eth0
grid.uhd.edu grid.local 255.255.255.255 UGH 0 0 0 eth0
10.1.0.0 * 255.255.0.0 U 0 0 0 eth0
169.254.0.0 * 255.255.0.0 U 0 0 0 eth0
224.0.0.0 * 240.0.0.0 U 0 0 0 eth0
default grid.local 0.0.0.0 UG 0 0 0 eth0
[root@compute-0-0 ~]#
Thanks and again I apologize for my delays in responding...
Cole Brand
http://grid.uhd.edu
2009/3/27 p K <p_k...@hotmail.com>
Hei, Ok the firewall is up again and here are the information :
>
> [root@froan log]# dbreport static-routes compute-0-0
> #
> # Do NOT Edit (generated by dbreport)
> #
> # Global routes
> any net 224.0.0.0/4 dev eth0
> any host 255.255.255.255 dev eth0
> *any host 129.241.249.140 gw 10.1.1.1*
> # Member routes
> # Node Routes
> [root@froan log]# route
> Kernel IP routing table
> Destination Gateway Genmask Flags Metric Ref Use
> Iface
> froan.fish.sint froan.local 255.255.255.255 UGH 0 0 0
> eth0
> 255.255.255.255 * 255.255.255.255 UH 0 0 0
> eth0
> *129.241.249.128 * 255.255.255.128 U 0 0 0
> eth1*
> 10.254.2.0 * 255.255.255.0 U 0 0 0
> eth3
> 10.254.1.0 * 255.255.255.0 U 0 0 0
> eth2
> 10.1.1.0 * 255.255.255.0 U 0 0 0
> eth0
> 169.254.0.0 * 255.255.0.0 U 0 0 0
> eth3
> 224.0.0.0 * 240.0.0.0 U 0 0 0
> eth0
> default cgw1-sab-051.si 0.0.0.0 UG 0 0 0
> eth1
>
> Thx
>
In the middle I bolded two items of import (if the formatting doesn't follow
my post, let me know and I'll isolate the two lines in question. For
comparison, here's my output for those two commands:
> [root@grid ~]# dbreport static-routes compute-0-0
> #
> # Do NOT Edit (generated by dbreport)
> #
> # Global routes
> any net 224.0.0.0/4 dev eth0
> any host 255.255.255.255 dev eth0
> *any host 192.168.1.11 gw 10.1.1.1*
> # Member routes
> # Node Routes
> [root@grid ~]# route
> Kernel IP routing table
> Destination Gateway Genmask Flags Metric Ref Use
> Iface
> 255.255.255.255 * 255.255.255.255 UH 0 0 0
> eth0
> grid.uhd.edu grid.local 255.255.255.255 UGH 0 0 0
> eth0
> *192.168.1.0 * 255.255.255.0 U 0 0 0
> eth1*
> 10.1.0.0 * 255.255.0.0 U 0 0 0
> eth0
> 169.254.0.0 * 255.255.0.0 U 0 0 0
> eth1
> 224.0.0.0 * 240.0.0.0 U 0 0 0
> eth0
> default 192.168.1.254 0.0.0.0 UG 0 0 0
> eth1
> [root@grid ~]#
>
And I've bolded the similar two lines on my configuration. Now, I don't
mean to imply that this IS the problem, but it *may* be the problem. I'll
look for the ssh compute-0-0 route command output to find out for sure if my
hunch is correct. I'm posting this at the moment for feedback from the rest
of the list, to see if there is a similar concern that I may have overlooked
a second trivial thing...
What are the IP addresses of the two hosts: (I've provided the bolded lines
from above for reference, and to offer what it appears that they are)
froan.fish.sint *129.241.249.140*
cgw1-sab-051.sintef.no *129.241.249.128*
but pulling a tidbit from another post, I should think that this following
line is the internal network address of "froan.fish.sint" except now we've
got another hostname still, but similar to the previous:
cgw1a-sab-051-sintef.no 129.241.249.135
Ok, let's not muddy the waters overly much, let's get some feedback on the
already posted questions and continue from there. To recap:
What is the output of `route` from compute-0-0 (via ssh presumably)?
Have the nodes ever been able to get to the internet?
Do we know "how" the eth2 and eth3 were added (but at this point it looks
like it doesn't matter)?
What are the internal IP addresses of froan.fish.sint and
cgw1-sab-051.sintef.no?
How does froan.fish.sint connect to the internet? (What is it's gateway
machine name and IP?) (or: Can you make a small diagram of the network
where this is installed, showing from the internet to the FE? - I would be
happy to demonstrate an example if you like)
2009/3/30 Cole Brand <drache...@gmail.com>
I didn't see any bold lines in your mail but I guess that were the 2 lines :
> eth3
> 10.254.1.0 * 255.255.255.0 U 0 0 0
> eth2
> 10.1.1.0 * 255.255.255.0 U 0 0 0
So to answer to the question:
-- The Name of the cluster is froan.fish.sintef.no
--What is the output of `route` from compute-0-0 (via ssh presumably)?
[root@compute-0-0 ~]# route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
froan.fish.sint 10.1.1.1 255.255.255.255 UGH 0 0 0 eth0
255.255.255.255 * 255.255.255.255 UH 0 0 0 eth0
10.1.1.0 * 255.255.255.0 U 0 0 0 eth0
169.254.0.0 * 255.255.0.0 U 0 0 0 eth0
224.0.0.0 * 240.0.0.0 U 0 0 0 eth0
default 10.1.1.1 0.0.0.0 UG 0 0 0 eth0
- Have the nodes ever been able to get to the internet?
NO
- Do we know "how" the eth2 and eth3 were added (but at this point it looks like it doesn't matter)?
I am still waiting for the answer from those whom did the job
- What are the internal IP addresses of froan.fish.sint and cgw1-sab-051.sintef.no?
The internal IP addresse of the FE of froan.fish.sintef.no is 10.1.1.1 (with ifconfig)
I dont know about the "cgw1-sab-051.sintef.no"
each nodes (4) have the internal IP : 10.1.1.254 /10.1.1.253 /10.1.1.252/10.1.1.251
- How does froan.fish.sint connect to the internet? (What is it's gateway machine name and IP?) (or: Can you make a small diagram of the network where this is installed, showing from the internet to the FE? - I would be happy to demonstrate an example if you like)
IP adresse of froan.fish.sintef.no 129.241.2492.140
Gateway 129.241.249.129
Do you need more information ?
Again Thanks
> Date: Mon, 30 Mar 2009 22:41:23 -0500
I have been waiting for the release of CentOS 5.3 (which is now out
http://ftp.gts.lug.ro/centos/5.3/isos/) because CentOS 5.2 which is used in
Rocks 5.1 did not support my Realtek NICs.
Anyways my question is, does anyone have any insight on using the new
version of centos with rocks (I understand that this is virgin territory as
5.3 was just released)? Should I be able to replace the two OS disk as part
of the Rocks Install with the first two cd iso's from the 5.3 release?
Any thoughts or feedback is greatly appreciated.
Cheers
Steve
Steven J. Berg
------------------------
Ph.D. Student
Physics 220C
Department of Earth and Environmental Sciences
University of Waterloo
I dl'd CentOS 5.3 and tried using the first iso's instead of the ones
included as part of the Rocks 5.1 roll. I guess it's not really a surprise,
but I ran into problems.
The install appeared to proceed fine, fancy new CentOS logo and all.
Anyways, after install was complete I was still unable to connect to the
internet (apparently it still doesn't like my Ethernet card) and when I
tried running insert-ethers from in terminal I received the following error:
"error - iteration over non-sequence"
I assume this is a 5.3 issue.
Another quick question.
My understanding is that the required drivers for my NICs are included in
the new kernel released as part of 5.3. However, there is a kernel roll as
part of the Rocks installation (likely contains a bunch of goodies for
rocks). Does the rocks kernel override the OS kernel, and could that be why
I am still having problems??
Steve
You will need to supply all of the CentOS disks. Not just the first two.
It should work just fine.
Tim
-------------- next part --------------
An HTML attachment was scrubbed...
no, there is no kernel RPM inside the kernel roll. the kernel RPMs
come from the base OS.
as others have remarked, your issue probably has to do with not
supplying *all* the CentOS CDs during the initial install. you will
need to reinstall the frontend, but this time supply all the CentOS
CDs (i believe there are 6-7), or you can supply the CentOS DVD.
- gb
How does the cluster find out about the new application, and how do
nodes make use of it?
Next, in env, I see the following, which doesn't seem right:
"//opt/Bio/glimmer/scripts:" (Why are there two // in front of
/opt/Bio ? It is the only env line that has // in front of it.)
Thx
Which variable gives you the dual "//"'s - PATH or some other.
Ian Kaufman
Research Systems Administrator
UC San Diego, Jacobs School of Engineering
> -----Original Message-----
> From: npaci-rocks-dis...@sdsc.edu [mailto:npaci-rocks-
You must put the path to your application in the execution PATH.
This is what I do. There are other ways.
1) Create a file called /share/apps/sys.bashrc. This is what mine looks like.
[pmk@superm chr11]$ cat /share/apps/sys.bashrc
alias rm='rm -i'
alias cp='cp -i'
alias mv='mv -i'
alias lsl='ls -la'
alias up='cd ..'
alias h='history'
alias md='mkdir'
alias rd='rmdir'
alias p='ps aux | grep $1'
alias k='kill -9 $1'
alias j=jobs
alias lpdoug='lp -ddoug'
alias lpma='/usr/bin/enscript -FCourier10 -fCourier7 -2rG -d doug'
alias lpmb='/usr/bin/enscript -c -r -G -FCourier10 -fCourier7 -2rG -Pdoug'
alias lpm='/usr/bin/enscript -h -fCourier7 -2r -B -Pdoug'
export
PATH=${PATH}:/share/apps/R-2.4.1/bin:/share/apps/cpm:/share/apps/pgi/linux86/6.1/bin:
export MANPATH=${MANPATH}:/share/apps/pgi/linux86/6.1/man
export PGI=/share/apps/pgi
export LM_LICENSE_FILE=${PGI}/license.dat
export PGRSH=ssh
export PERL5LIB=/share/apps/R-2.4.1/library/RSPerl/perl
LD_LIBRARY_PATH=/share/apps/R-2.4.1/lib:/share/apps/R-2.4.1/library/RSPerl/libs:/share/apps/gsl/lib:$LD_LIBRARY_PATH
[pmk@superm chr11]$
2) Any applications you put out on /share/apps you must append the
path to the executable file to the "export PATH" command listed above.
3) add 'source /share/apps/sys.bashrc' to each user's .bashrc file.
The next time they log in they will be able to call your application
from any directory, on any node.
Good Luck,
Paul
At 03:42 PM 3/31/2009, you wrote:
>On a new rock 5.1 installation with several nodes, We've installed
>our application into /share/apps/applicationname as the documention
>says.. yet when I run env, I don't find a path to /share/apps/ ..
>
>How does the cluster find out about the new application, and how do
>nodes make use of it?
>
>Next, in env, I see the following, which doesn't seem right:
>"//opt/Bio/glimmer/scripts:" (Why are there two // in front of
>/opt/Bio ? It is the only env line that has // in front of it.)
>
>Thx
______________________________________________________
Paul Kopec
Project Manager
University of Michigan
Dept. of Human Genetics
1241 E. Catherine Street
5928 Buhl Building
Ann Arbor, MI 48109-0618
734-763-5411
pko...@umich.edu
-------------- next part --------------
An HTML attachment was scrubbed...
PATH=/usr/local/bin:/usr/local/sbin://opt/Bio/glimmer/scripts
and PATH is the variable, but if it is something else, the problem
may lie elsewhere. Usually a double slash is ignored, and you
shouldn't have a problem. But, it indicates that whatever script/code
added it to the path is not 100% correct.
Ian Kaufman
Research Systems Administrator
UC San Diego, Jacobs School of Engineering
> -----Original Message-----
> From: npaci-rocks-dis...@sdsc.edu [mailto:npaci-rocks-
> discussio...@sdsc.edu] On Behalf Of Dave Felt
> Sent: Tuesday, March 31, 2009 1:19 PM
> To: Discussion of Rocks Clusters
Some days I need more coffee, I think... I'm going to repost something from
> before, and I would still like the route output from compute-0-0, but let's
> investigate something else as well. You posted this some time ago:
>
> Hei, Ok the firewall is up again and here are the information :
>>
>> [root@froan log]# dbreport static-routes compute-0-0
>> #
>> # Do NOT Edit (generated by dbreport)
>> #
>> # Global routes
>> any net 224.0.0.0/4 dev eth0
>> any host 255.255.255.255 dev eth0
>> *[[any host 129.241.249.140 gw 10.1.1.1*]]
>> # Member routes
>> # Node Routes
>> [root@froan log]# route
>> Kernel IP routing table
>> Destination Gateway Genmask Flags Metric Ref Use
>> Iface
>> froan.fish.sint froan.local 255.255.255.255 UGH 0 0 0
>> eth0
>> 255.255.255.255 * 255.255.255.255 UH 0 0 0
>> eth0
>> *[[129.241.249.128 * 255.255.255.128 U 0 0
>> 0 eth1*]]
>> 10.254.2.0 * 255.255.255.0 U 0 0 0
>> eth3
>> 10.254.1.0 * 255.255.255.0 U 0 0 0
>> eth2
>> 10.1.1.0 * 255.255.255.0 U 0 0 0
>> eth0
>> 169.254.0.0 * 255.255.0.0 U 0 0 0
>> eth3
>> 224.0.0.0 * 240.0.0.0 U 0 0 0
>> eth0
>> default cgw1-sab-051.si 0.0.0.0 UG 0 0 0
>> eth1
>>
>> Thx
>>
>
> In the middle I bolded two items of import (if the formatting doesn't
> follow my post, let me know and I'll isolate the two lines in question. For
> comparison, here's my output for those two commands:
>
>> [root@grid ~]# dbreport static-routes compute-0-0
>> #
>> # Do NOT Edit (generated by dbreport)
>> #
>> # Global routes
>> any net 224.0.0.0/4 dev eth0
>> any host 255.255.255.255 dev eth0
>> [[*any host 192.168.1.11 gw 10.1.1.1]]*
---------------------------------------------------------------------------------------------------
Ok, let's get back to the other part. I requoted mine that so I can requote
yours and keep going. I hope the quoting shows up for those following.
---------------------------------------------------------------------------------------------------
I'll post followup to this below
---------------------------------------------------------------------------------------------------
And to break with my traditional reply technique, I'm deleting everything
else from the reply string. While this breaks searches on google where
people can follow the whole reply sequence (which is why I normally do it
the other way) this shortens the message and keeps us focused, so it's a
tradeoff.
** PART A **
The first set of bolded lines was this pair, marked with both styles of
"highlighting" (brackets and bold)
*[[any host 129.241.249.140 gw 10.1.1.1]]*
*[[129.241.249.128 * 255.255.255.128 U 0 0 0
eth1]]*
And what I take from this is that the [root@froan log]# dbreport
static-routes compute-0-0
dbreport and what it reports from the FE don't quite matchup, because as
mine shows:
** PART B **
[root@grid ~]# dbreport static-routes compute-0-0
*[[**any host 192.168.1.11 gw 10.1.1.1]]*
*[[**192.168.1.0 * 255.255.255.0 U 0 0 0
eth1]]*
they are just a little bit different. Do you see how yours on the second
line has 129.241.249.128 and mine on the second line has 192.168.1.0? That
.0 means that it's for any .xxx on that interface, not just the one IP.
It's not necessarily "wrong", but it's something to note as we continue.
** PART C **
Then I take and I look at how the network beyond the FE sees the FE, and
that's this part:
*[[*froan.fish.sintef.no *129.241.249.140]]*
* [[*cgw1-sab-051.sintef.no *129.241.249.128]]*
and so I presume that the froan.fish.sintef.no is getting an IP from it's
internal network, but it looks like the FE is getting a DHCP IP from the
network, as opposed to a statically allocated IP. Has the FE been rebooted
recently? And the answer to this question would appear to be yes, since you
mentioned that the eth2 and eth3 were added in later. Since the FE has been
rebooted, and was likely off the network for more than a few minutes, it
possibly acquired a new IP, which changes how the compute-nodes are supposed
to talk to the internet. That's part one of the inconsistency noted in
(PART A) above (compare to PART B). This would explain why the computes are
looking for 129.241.249.140 and the FE thinks it's on 129.241.249.128. That
brings us to the part towards the bottom (PART C). However, the
configuration would appear to be that actually the compute nodes are lookin
to route through cgw1-sab-051.sintef.no, but since they're not "close" to
that computer, that they can't route that way.
However, having said that, this doesn't make any sense. *Sorry, but I'm
trying to think out loud here, and I want the thought process to be followed
*. 129.241.249.128 is the gateway comptuer (cgw) and it appears that it is
the first gateway on the network, hence the 1 in cgw1. The rest of the name
for cgw1-sab-051.sintef.no would likely make more sense if I spoke the local
language, but my being english, I don't understand what the "sab" stands for
(is that a building name by chance?). But that is not important for the
configuration, only that we know what the numbers are. Now that I've said
that cgw1 looks to be your gateway (from the routing information provided so
far) then I can only say that I don't understand now why the gateway
provided in your response is 129.241.249.129. That would be inconsistent
with the routing information given. So let's backup and start again. I'm
liable to backtrack after this, I've been working on it for a couple of
days, and thinking about what might have gone wrong, so there are some
starts and stops to the typing.
Can you confirm that the FE has a static IP, and not a DHCP IP? (I'll post
all the questions again at the bottom for concise answers)
This next part will only look "correct" if it is seen in "Courier New" so it
may be better if you copy everything in these lines and put it into notepad,
and match the font. I'm going to draw a diagram in text. I'm sorry if it
appears too cluttered, but I want things to be correct, and a picture (even
in text) is worth a lot! All the information presented in the "your network
graph" is taken from the emails that we've had so far. I want to make sure
that they are correct, so feel free to tell me if they are wrong.
Your network as I understand it right now - PLEASE TELL ME IF THIS DOESN'T
MATCH
--------------------------------copy----------------------------------
[ compute node] [ frontend/head node ]
[ compute-0-0 ] [froan.fish.sintef.no]
[ 10.1.1.254 ]---[switch]---[10.1.1.1 ]
[ compute-0-1 ] | | | [ 129.241.249.140]-----[switch]
[ 10.1.1.253]----+ | | [ 10.254.1.1 ] | |
[ compute-0-2 ] | | [ 10.254.2.1 ] | |
[ 10.1.1.252]------+ | | |
[ compute-0-3 ] | | |
[ 10.1.1.251]--------+ [cgw1-sab-051.sintef.no] | | <---???
[ 129.241.249.128?]-+ | <---???
{internet}----[129.241.249.129? ] | <---???
|
[cgw1a-sab-051.sintef.no ] | <---???
[ 129.241.249.135?]-+ <---???
{internet}----[129.241.249.129? ] <---???
-------------------------------- fin ---------------------------------
And another drawing. My network, more or less (I've got more nodes and
switches behind my FE, but they have no immediate bearing on why your four
can't get to the internet).
--------------------------------copy----------------------------------
[compute nodes] [ frontend/head node ]
[ compute-0-0 ] [ grid.uhd.edu ]
[ 10.1.1.254 ]---[switch]---[10.1.1.1 ]
[ compute-0-1 ] | | | [ 129.168.1.11]-----[switch]
[ 10.1.1.253]----+ | | |
[ compute-0-2 ] | | |
[ 10.1.1.252]------+ | |
[ and more ] | |
[ not listed ]--------+ [ "dsl" router ] |
[ 192.168.1.254]-+
{internet}----[71.143.128.57 ]
-------------------------------- fin ---------------------------------
Ok, so hopefully that makes sense, and it can be understood. I'm going to
work from those two drawings on this next bit, but I think I'm back to
rambling on, so if I'm repeating myself, I offer my apologies. I have no
proofreaders. :-(
On your network, the routing table for compute-0-x in my understanding
should look like this (again, this looks better in courier new):
[user@compute-0-0 ~]# route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use
Iface
255.255.255.255 * 255.255.255.255 UH 0 0 0 eth0
froan.fish.sint froan.local 255.255.255.255 UGH 0 0 0 eth0
10.1.0.0 * 255.255.0.0 U 0 0 0 eth0
169.254.0.0 * 255.255.0.0 U 0 0 0 eth0
224.0.0.0 * 240.0.0.0 U 0 0 0 eth0
default froan.local 0.0.0.0 UG 0 0 0 eth0
[user@compute-0-0 ~]#
but what we have is:
[root@compute-0-0 ~]# route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use
Iface
froan.fish.sint 10.1.1.1 255.255.255.255 UGH 0 0 0 eth0
255.255.255.255 * 255.255.255.255 UH 0 0 0 eth0
10.1.1.0 * 255.255.255.0 U 0 0 0 eth0
169.254.0.0 * 255.255.0.0 U 0 0 0 eth0
224.0.0.0 * 240.0.0.0 U 0 0 0 eth0
default 10.1.1.1 0.0.0.0 UG 0 0 0 eth0
The part where it says 10.1.1.1 instead of froan.local is not a bad thing,
since you have more than one adapter on 10.x.x.x in froan, but I am curious
why the line 10.1.1.0 is not 10.1.0.0 with a netmask of 255.255.0.0; again,
this is not likely to stop anything, that should be ok. I know that there
are several subtle differences between the two, such as a netmask change, an
IP change, some different namings, etc.
Just so long as the following output looks very close (identical would be
preferred, but I added spacing for alignment) to this:
[user@compute-0-0 ~]# cat /etc/hosts
# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1 localhost.localdomain localhost
129.241.249.140 froan.fish.sintef.no
10.1.255.254 compute-0-0.local compute-0-0
Continuing on with our analysis of what it should look like (in my mind -
I've never claimed to always be right), I would expect to see this on the FE
(froan.fish.sintef.no):
[root@froan log]# route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use
Iface
froan.fish.sint froan.local 255.255.255.255 UGH 0 0 0 eth0
255.255.255.255 * 255.255.255.255 UH 0 0 0 eth0
129.241.249.0 * 255.255.255.0 U 0 0 0 eth1
10.254.2.0 * 255.255.255.0 U 0 0 0 eth3
10.254.1.0 * 255.255.255.0 U 0 0 0 eth2
10.1.1.0 * 255.255.255.0 U 0 0 0 eth0
169.254.0.0 * 255.255.0.0 U 0 0 0 eth3
224.0.0.0 * 240.0.0.0 U 0 0 0 eth0
default 129.241.249.128 0.0.0.0 UG 0 0 0 eth1
instead of this
[root@froan log]# route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use
Iface
255.255.255.255 * 255.255.255.255 UH 0 0 0
eth0
froan.fish.sint froan.local 255.255.255.255 UGH 0 0 0 eth0
129.241.249.128 * 255.255.255.128 U 0 0 0 eth1
10.254.2.0 * 255.255.255.0 U 0 0 0 eth3
10.254.1.0 * 255.255.255.0 U 0 0 0 eth2
10.1.1.0 * 255.255.255.0 U 0 0 0 eth0
169.254.0.0 * 255.255.0.0 U 0 0 0 eth3
224.0.0.0 * 240.0.0.0 U 0 0 0 eth0
default cgw1-sab-051.si 0.0.0.0 UG 0 0 0 eth1
Again, there are some subtle differences, such as different ordering of
lines, and some netmask changes, and different identifiers (instead of the
name, the IP for cgw1...)
But I'm still confused on this. Let's go back to the drawing above about
your network. Do you see the two nodes marked with "<--- ???" Can you
clarify that for me at all? If you don't know the answer, or don't know the
names of the computers, that's fine, but it makes some difference. But we
can work around it, so long as we know which "gateway" leads us to the
internet.
Ok, all that was to get to this. I think that it looks like the IP
addresses on the network are being given to the FE from DHCP, not from
static. If it is static, then something is misconfigured somewhere. I can
tell you what to do to change the IPs, and how to correct it, and I can do
that in several different ways, but let's get to a meeting ground first
(figuratively speaking). Let's play one more round of questions, and then
let's do a solution.
All questions for this round:
1) Can you state for sure that the network IP address for
froan.fish.sintef.no is a static IP for the network? (See #2 below)
2) Do you have the documentation from the network administration that says
what the IP address for froan.fish.sintef.no is, and the netmask that it
should use, and the IP of the gateway that froan.fish.sintef.no should use?
(I had to get exactly this information in writing from my network
administration, so that they and I would be in agreement before I made a
public interface for my cluster.)
2a) Would you mind copying and pasting or typing in the information for
those three values?
3) If you don't have the documentation, can they provide that to you, in
writing? (I'm sorry to make this one so exact if it does not need to be
that way for you, but for most large groups, this is a standard practice. I
believe it would be standard for you as well, so I ask.)
4) Would you be upset if I asked you to reinstall the compute nodes? By
this I mean, have you made any customizations to the compute nodes that are
not part of the rocks maintenance package? This would include data on local
scratch partitions that are not on /state/partition1, any particular library
optimizations that aren't in a roll, and so on. I don't anticipate asking
you to do this, but it wouldn't take very long, and it would be a little
less "work" intensive, especially if something is still not right. Plus,
it's a safegaurd to reinstall the nodes when things don't work right.
4a) I don't think I've asked this yet, have you tried reinstalling any of
the four nodes lately? As I've typed all this up, it occurs to me that
rocks remove host <hostname> and then reinstalling might just clear the
whole mess up, without another word.
5) Are any other nodes connected to eth2 or eth3 as separate networks? (I
doubt it, but I'm trying to cover all the questions before they are
problems.)
6) Are there any other devices in the network that need to be "fixed"
besides the compute nodes and the FE?
7) VERY IMPORTANT: In my diagram of your network, did anything there "not
match"?
8) In my diagram of your network, is any part WRONG? Even a little bit of
wrong is important.
9) To anyone else following this thread (and if you're reading and haven't
tried to help, seeing as I've been away, I say -> :-P ), do you see anything
I've missed on the reconstruction here? It looks like some values have
gotten criss-crossed, and I'm going to "straighten" them out, but I want to
do it right in the first pass. (I'm being brave aren't I?)
Now, here's the final word on this. Obviously I would like "some" feedback,
but this may be the final answer that you need. To change the IP
configuration on this, depending on what the answers are to 2 and 2a, and
how the network drawing compares to reality, could be very simple, or could
be rather drawn out. I personally did change my public IP by hand (partly
because I'm brave, and partly because I could stand the reinstall from
scratch, and partly because I know something about the systems involved) but
I don't just recommend it as a first option. I would prefer to identify
what is wrong, and formulate a course of action that is a little more
programmatic in nature. What this means for you is that it may be simpler
and faster for you to make a backup of the configuration and rebuild the FE
from the DVD. There is a good walkthrough of this in the documentation.
Ok, sorry again for the delays in the response, but I'm sure that you've not
been idle with the time.
Cheers,
Cole Brand
http://grid.uhd.edu
-------------- next part --------------
An HTML attachment was scrubbed...