I currently have a "nice" 24-node 8-core/node cluster, all 64 bit,
running ROCKS.
I've also inherited an old, 16-node, 2-proc, 32-bit cluster with no
cluster management system, queuing system, or anything, and running
RedHat 9. The users are working on migrating their code to the new
cluster. Once that's complete, I get to reformat the old cluster and
try and make it useful again. This has gotten me thinking as to how
best to set this up.
My "ideal world" has already been shot full of many holes (just add
the nodes to the existing cluster and let them all be shared
transparently by users) for many good reasons.
My next-best-plan would be to have the same headnode "run" both
clusters, and just have two different queues that users submit to for
the two clusters (i.e., the nodes only belong to one queue or the
other). This has several advantages from my vantage point:
1) I only pay for one PGI license (as opposed to the two I currently pay for)
2) I get 16 compute nodes out of the old cluster (instead of the 15 +
1 headnode I'd get -- the new cluster has a dedicated head node)
3) I get a single login system and home directories (without having to
do extra work to try and combine them)
4) Users have the same interface/tools, they need only change one
option to submit to the right queue (rather than remembering there's
another cluster and logging into it)
5) it makes it easy to share the existing 30TB PVFS volume with the
old cluster (currently there's no access there)
6) Users can see the status of both clusters through a single ganglia
view (although some extra work will probably be required)
Potential disadvantages:
1) The whole compiling for 32-bit vs 64-bit problem
2) How does rocks deal with this situation, anyway?
3) have to set up a cross-kickstart environment on the head node
Does anyone have any other things to take into account? How does one
compile code for a 32bit target on a 64-bit headnode? Is this "a good
idea"? ( A good part of me has visions of a basically-always-idle
cluster if the two are set up seperate, so I'm very interested in
trying to combine them if it could work out decently...
--Jim
So how old are the "old" nodes. Can they run a 64bit OS? That would solve
the 32/64 problem nicely. If you users are migrating code to the new
cluster I assume they are recompiling everything for 64 bit?
Tim
mr9540.local: Connection refused
p0_11729: p4_error: Child process exited while making connection to remote process on mr9540: 0
p0_11729: (33.007812) net_send: could not write to fd=4, errno = 32
using the command:
export PATH=/share/apps/orca_amd64_exe/:/opt/mpich/gnu/bin/:$PATH && /share/apps/orca_amd64_exe/orca testORCA.inp
The first part of the command adds ORCA and the correct mpi type to the front of the path while the second part (after the &&) runs the job. It seems from the error I got that there is a communication problem when using mpi, unfortunately I really don't know how to fix this...
>
> My "ideal world" has already been shot full of many holes (just add
> the nodes to the existing cluster and let them all be shared
> transparently by users) for many good reasons.
>
> My next-best-plan would be to have the same headnode "run" both
> clusters, and just have two different queues that users submit to
> for
> the two clusters (i.e., the nodes only belong to one queue or the
> other). This has several advantages from my vantage point:
How are these two situations different? To do the latter, you would
basically integrate them into the cluster as 32 bit nodes, and then
define two queues, one for 64 bit and one for 32 bit. I do this now.
Then, instead of the default queue being used, you make sure the users
specify which queue they want/need. Not quite transparent, but it
is pretty close.
> 1) I only pay for one PGI license (as opposed to the two I currently
> pay for)
> 2) I get 16 compute nodes out of the old cluster (instead of the 15
> +
> 1 headnode I'd get -- the new cluster has a dedicated head node)
> 3) I get a single login system and home directories (without having
> to
> do extra work to try and combine them)
> 4) Users have the same interface/tools, they need only change one
> option to submit to the right queue (rather than remembering there's
> another cluster and logging into it)
> 5) it makes it easy to share the existing 30TB PVFS volume with the
> old cluster (currently there's no access there)
> 6) Users can see the status of both clusters through a single
> ganglia
> view (although some extra work will probably be required)
All sound like great advanatges.
>
> Potential disadvantages:
>
> 1) The whole compiling for 32-bit vs 64-bit problem
> 2) How does rocks deal with this situation, anyway?
> 3) have to set up a cross-kickstart environment on the head node
Some compilers it is as simple as supplying a flag when building.
Or, just have them build code on one of the 32 bit nodes. You could
even have SGE submit the compile. Cross-kickstarting is not that
bad.
I believe that gcc compiles 32 bit code with "-m32" or "-march=i386".
Ian Kaufman
Research Systems Administrator
Jacobs School of Engineering
Also, is /share/apps/orca_amd64_exe/orca a script, or the compiler
output? If it's a script, somewhere in it is a line that specifies the
"nodes" file, which machines to contact to start the MPI job. (mpiexec
handles this transparently when launched via torque -- it gets the list
of nodes assigned to the job via an environment variable.) It looks
like it's trying to connect to a machine named mr9540. Yours are likely
named compute-0-0 and compute-0-1.
If share/apps/orca_amd64_exe/orca, share it with us. If it's really an
executable (binary) file, then read `man mpiexec` or `man mpirun`.
Bart
This message contains information that may be confidential, privileged or otherwise protected by law from disclosure. It is intended for the exclusive use of the Addressee(s). Unless you are the addressee or authorized agent of the addressee, you may not review, copy, distribute or disclose to anyone the message or any information contained within. If you have received this message in error, please contact the sender by electronic reply to em...@environcorp.com and immediately delete all copies of the message.
check to make sure ORCA is setup to use ssh rather than rsh when it
talks to the nodes. Otherwise it could be a number
of other things that would require more details to be given (and also
perhaps someone that was familiar with the parallel
implementation of ORCA - is there a user list? That seems like a much
better venue for this question )
-Kirk
--- On Tue, 12/16/08, Bart Brashers <bbra...@environcorp.com> wrote:
--- On Tue, 12/16/08, Kirk Peterson <kipe...@wsu.edu> wrote:
> From: Kirk Peterson <kipe...@wsu.edu>
> Subject: Re: [Rocks-Discuss] mpi communcation problem
> To: "Discussion of Rocks Clusters" <npaci-rocks...@sdsc.edu>
PBS Job Id: 15
Job Name: testPCGAMESS
Exec host: compute-0-1/3+compute-0-1/2+compute-0-1/1+compute-0-1/0+compute-0-0/3+compute-0-0/2+compute-0-0/1+compute-0-0/0
An error has occurred processing your job, see below.
Post job file processing error; job 15 on host compute-0-1/3+compute-0-1/2+compute-0-1/1+compute-0-1/0+compute-0-0/3+compute-0-0/2+compute-0-0/1+compute-0-0/0
Unable to copy file /opt/torque/spool/15.OU to root@mr9540:/root/testPCGAMESS/testPCGAMESS.o15
>>> error from copy
Permission denied (publickey,gssapi-with-mic,password).
lost connection
>>> end error output
Output retained on that host in: /opt/torque/undelivered/15.mr9540.OU
Unable to copy file /opt/torque/spool/15.mr9540.ER to root@mr9540:/root/testPCGAMESS/testPCGAMESS.e15
>>> error from copy
Permission denied (publickey,gssapi-with-mic,password).
lost connection
>>> end error output
Output retained on that host in: /opt/torque/undelivered/15.mr9540.ER
--- On Tue, 12/16/08, Kirk Peterson <kipe...@wsu.edu> wrote:
> From: Kirk Peterson <kipe...@wsu.edu>
> Subject: Re: [Rocks-Discuss] mpi communcation problem
> To: "Discussion of Rocks Clusters" <npaci-rocks...@sdsc.edu>
> Date: Tuesday, December 16, 2008, 10:57 AM
looks like it's the same problem as before. I think Bart pointed out
that the hostname of mr9540 is the smoking gun. You're evidently not
passing a valid hostfile to mpirun. If you run this in PBS/torque you
should pass the file that's referenced by the environment variable
$PBS_NODEFILE. You'll have to check what ORCA or GAMESS require.
-Kirk
Bart
> -----Original Message-----
> From: npaci-rocks-dis...@sdsc.edu
[mailto:npaci-rocks-dis...@sdsc.edu] On
> Behalf Of Jonas Baltrusaitis
> Sent: Tuesday, December 16, 2008 12:30 PM
> To: Discussion of Rocks Clusters
> Subject: Re: [Rocks-Discuss] mpi communcation problem
>
It sounds like Torque is trying to use SSH to copy files between nodes, which
fails when it hits the password prompt. Usually you don't want Torque to do
this; you want it to just copy things into place with the understanding that
they will be available in the same place on the backend due to a NFS mount or
other shared filesystem. See this mailing list thread for more info:
https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2008-June/031406.html
-Brandon
Jonas Baltrusaitis wrote:
> This is what is mailed to the root account from torque. This is not ORCA anymore but PCGAMESS, so different software package. It looks like torque cannot copy files properly (or should I post it in torque forum?..).
>
> PBS Job Id: 15
> Job Name: testPCGAMESS
> Exec host: compute-0-1/3+compute-0-1/2+compute-0-1/1+compute-0-1/0+compute-0-0/3+compute-0-0/2+compute-0-0/1+compute-0-0/0
> An error has occurred processing your job, see below.
> Post job file processing error; job 15 on host compute-0-1/3+compute-0-1/2+compute-0-1/1+compute-0-1/0+compute-0-0/3+compute-0-0/2+compute-0-0/1+compute-0-0/0
>
> Unable to copy file /opt/torque/spool/15.OU to root@mr9540:/root/testPCGAMESS/testPCGAMESS.o15
>>>> error from copy
> Permission denied (publickey,gssapi-with-mic,password).
> lost connection
>>>> end error output
> Output retained on that host in: /opt/torque/undelivered/15.mr9540.OU
>
> Unable to copy file /opt/torque/spool/15.mr9540.ER to root@mr9540:/root/testPCGAMESS/testPCGAMESS.e15
>>>> error from copy
> Permission denied (publickey,gssapi-with-mic,password).
> lost connection
>>>> end error output
> Output retained on that host in: /opt/torque/undelivered/15.mr9540.ER
--
Brandon Davidson
Systems Administrator
University of Oregon Neuroinformatics Center
(541) 346-2417 bran...@uoregon.edu
Key Fingerprint 1F08 A331 78DF 1EFE F645 8AE5 8FBE 4147 E351 E139
#!/bin/sh
#PBS -l nodes=2:ppn=4
mpirun -m $PBS_NODEFILE -np 8 /share/apps/pcg71c -r -f -p -i /root/testPCGAMESS/testPCGAMESS.inp -o /root/testPCGAMESS/testPCGAMESS.out -ex /share/apps/pcg71c/fastdiag.ex&pcgp2p.ex -t /share/apps/scratch/
[root torque/mom_priv]# ssh c0-0 cat /opt/torque/mom_priv/config
$pbsserver [frontend].local
$usecp [frontend.domain.com]:/home /home
The more important aspect is that you're trying to run this in /root,
and/or as root.
Bart
--- On Tue, 12/16/08, Gus Correa <g...@ldeo.columbia.edu> wrote:
*** Do Not Submit Jobs As Root ***
Create a new user, make sure that its home directory is available on the backend
nodes by running 'rocks sync user', and then log in as that user and submit your
jobs.
Besides all the myriad other reasons not to do things as root, its home
directory is not shared between nodes. This will cause Torque to attempt to use
SSH to stage files in and out, leading to the problem addressed in my previous
email.
-Brandon
Jonas Baltrusaitis wrote:
> But I am passing a valid hostfile... Below is my submission script
>
> #!/bin/sh
> #PBS -l nodes=2:ppn=4
> mpirun -m $PBS_NODEFILE -np 8 /share/apps/pcg71c -r -f -p -i /root/testPCGAMESS/testPCGAMESS.inp -o /root/testPCGAMESS/testPCGAMESS.out -ex /share/apps/pcg71c/fastdiag.ex&pcgp2p.ex -t /share/apps/scratch/
As Bart pointed out, you should run as a regular user.
Create one, do "rocks sync users",
login as this user, submit the job, as Bart recommended.
You may still have
/opt/torque/spool/15.OU
on one of the nodes.
Check with
cluster-fork 'ls /opt/torque/spool/'
and
cluster-fork 'ls /opt/torque/undelivered/'
It would help if you show your Torque/PBS scripts,
mpirun/mpiexec commands, etc.
We're not all computational chemists.
Are PCGAMESS and ORCA precompiled executables, or did you
compile them from the source code?
Gus Correa
Jim Kusznir wrote:
> The nodes in question are "Intel(R) Xeon(TM) CPU 3.06GHz" (dual-cpu
> with hyperthreading). My understanding is that these are 32-bit CPUs
> (but I'd love to be corrected!).
I run a system somewhat like this - I have a 64bit frontend, 16 8-way 64-bit
Xeons, a 32-bit login node, and 16 dual-core-HT 32-bit Xeons. I have one queue
for the 64-bit nodes, one for the 32-bit nodes, and one for all of them.
Rocks doesn't support this configuration out of the box, as it will only boot
nodes of the same architecture as the frontend - you'd have to pop in the Kernel
DVD whenever you wanted to rebuilt the 32-bit backends. I have a patch for 5.0
that fixes this that I think will be pretty easy to port to 5.1, although I
haven't tried yet. It would probably also allow you to build a 32-bit Xen
virtual cluster on 64-bit nodes, which could be interesting.
-Brandon
OK, so now it's better. I created a user. I submit a PCGAMESS jobs with following script and get a following error. At least I am getting something!..
#!/bin/sh
#PBS -l nodes=2:ppn=4
mpirun -H $PBS_NODEFILE -n 8 /share/apps/pcg71c -r -f -p -i /home/jbaltrus/testPCGAMESS/testPCGAMESS.inp -o /home/jbaltrus/testPCGAMESS/testPCGAMESS.out -ex /share/apps/pcg71c/fastdiag.ex&/share/apps/pcg71c/pcgp2p.ex -t /share/apps/scratch/
Error:
ssh: /opt/torque/aux//36.mr9540.healthcare.uiowa.edu: Name or service not known
/opt/torque/mom_priv/jobs/36.mr9540.healthcare.uiowa.edu.SC: line 3: /share/apps/pcg71c/pcgp2p.ex: cannot execute binary file
mpirun: killing job...
# ls -lF /share/apps/pcg71c/pcgp2p.ex
# ssh c0-0 /share/apps/pcg71c/pcgp2p.ex
It says it can't execute (run) the file...
I'm not sure what the "ssh: " line is all about.
Bart
> -----Original Message-----
> From: npaci-rocks-dis...@sdsc.edu
[mailto:npaci-rocks-dis...@sdsc.edu] On
> Behalf Of Jonas Baltrusaitis
[root@mr9540 ~]# ls -lF /share/apps/pcg71c/pcgp2p.ex
-rwxr-xr-x 1 root root 35328 Dec 15 19:15 /share/apps/pcg71c/pcgp2p.ex*
[root@mr9540 ~]# ssh c0-0 /share/apps/pcg71c/pcgp2p.ex
ssh: c0-0: Name or service not known
[root@mr9540 ~]# ssh compute-0-0 /share/apps/pcg71c/pcgp2p.ex
bash: /share/apps/pcg71c/pcgp2p.ex: cannot execute binary file
[root@mr9540 ~]#
--- On Tue, 12/16/08, Bart Brashers <bbra...@environcorp.com> wrote:
Some wild guesses.
I saw on the PCGAMESS site that they have over 64 pre-compiled versions!
Which one did you install?
I would try the version with MPICH static libraries and ssh first.
Perhaps optimized for your architecture (but maybe just the generic
"Pentium").
You certainly need an ssh version (Rocks uses ssh not rsh).
Also, I haven't seen any 64-bit version of PCGAMESS (but I didn't search
much).
Not sure if your system is 64-bit.
In any case, I presume PCGAMESS is linked to MPICH-1, right?
In this case, you need to launch it with the mpirun that comes with MPICH-1.
Most likely this is not the first mpirun on your path.
What is the output of "which mpirun"?
On my Rocks 4.3 this is the OpenMPI mpirun, not the MPICH-1 mpirun.
However, I have the MPICH-1 mpirun on:
/opt/mpich/gnu/bin/mpirun
You need to find the correct one on your system
(the "locate" command is your friend), and use the full path name
on your PBS script.
I hope this helps,
# cat /opt/torque/server_priv/nodes
Bart
# ssh c0-0 ls -lF /share/apps/pcg71c/pcgp2p.ex
Which MPI are you using? Unless you are using a Torque-aware MPI, it will still
want to use SSH to launch your jobs. Both OpenMPI and LAM can be built to use
Torque, but neither Rocks or the Torque roll provides one... so you'll have to
make sure that you have SSH keys set up in advance.
Additionally, it sounds like your mpirun is expecting the -H argument to be a
list of hosts, not a file containing the list.
You might try running a simple MPI hello-world app before moving up to an actual
application, just to make sure that the basics are all in place.
-Brandon
Bart
looks alright to me
--- On Tue, 12/16/08, Bart Brashers <bbra...@environcorp.com> wrote:
Linux MPICH (using ssh as remote shell by default), fully statically linked Serial/parallel Linux binaries linked with MPICH, optimized for Pentium 4, Pentium D, Xeon, Intel Core 2 (Conroe/Merom/Woodcrest/Clovertown etc..., Penryn/Harpertown etc...), Intel Core i7 (Nehalem etc..) processors, as well as for AMD Phenom (tri- and four-core)/AMD Barcelona (four-core Opterons) processors.
--- On Tue, 12/16/08, Bart Brashers <bbra...@environcorp.com> wrote:
# ssh compute-0-0 ls -lF /share/apps/pcg71c/pcgp2p.ex
I just want to make sure the auto-mounter is working correctly.
I don't think so. I'm pretty sure you want just the name of the
frontend, without the FQDN, like so:
mr9540 np=8
compute-0-0 np=4
compute-0-1 np=4
You can certainly test it easily, using a simple script:
# cat testme.csh
#!/bin/csh -f
echo $hostname
sleep 60
# qsub -l nodes=mr9540 testme.csh
And see if it works.
Bart
Do you want me to revert to mr9540? Let me see if you thought me right:
change it in /opt/torque/server_priv/nodes and re-start pbs_server?
--- On Tue, 12/16/08, Bart Brashers <bbra...@environcorp.com> wrote:
> From: Bart Brashers <bbra...@environcorp.com>
> Subject: Re: [Rocks-Discuss] mpi communcation problem
> To: "Discussion of Rocks Clusters" <npaci-rocks...@sdsc.edu>
I will try to run compute-0-0 instead of $PBS_NODEFILE and see if I can get any of the prorgams going. BTW, where is the $PBS_NODEFILE and how can I cahnge it so it omits frontend for now?
--- On Tue, 12/16/08, Bart Brashers <bbra...@environcorp.com> wrote:
> From: Bart Brashers <bbra...@environcorp.com>
> Subject: Re: [Rocks-Discuss] mpi communcation problem
> To: "Discussion of Rocks Clusters" <npaci-rocks...@sdsc.edu>
> Date: Tuesday, December 16, 2008, 5:09 PM
Do a cat /proc/cpuinfo and see if it has the lm flag that stands for long mode
Or put a 64bit live cd on christmas when you can stop the node... ;-)
But you also have to be sure eventually that in that case the
motherboard support 64bit too.
I remember a post in another mailing list where indeed an Intel(R)
Xeon(TM) CPU 3.00GHz had both lm and ht (hyperthreading) flags set but
the mobo used was not 64bit capable...
to be verified also the actual 64bit performance of these cpus
In general the definitions of the flags are in
include/asm/cpufeature.h of kernel tree
HIH,
Gianluca
$PBS_NODEFILE is generated on the fly by Torque/PBS,
one file for each job, based on the /opt/torque/server_priv/nodes,
and on the availability of resources.
If you want to modify something and remove the frontend for now,
the right file to edit is /opt/torque/server_priv/nodes.
You must restart the pbs_server (on the frontend) after you edit the file,
for the change to take place ( service pbs_server stop; service
pbs_server start)
Bart and Brandon recommended that, and I second their suggestion,
at least until you get the basic functionality to work.
Also, try Brandon's suggestion to run a basic MPI program,
to check if the nodes can talk to each other.
If you want to stick to MPICH-1, which seems to be what PCGAMESS uses,
there are very simple example programs in /opt/mpich/gnu/examples/
(at least on my Rocks 4.3). Try cpi.c, maybe the hello++.cc also.
To avoid confusion with your PATH,
use full path names to compile /opt/mpich/gnu/bin/mpicc, etc,
and for the mpirun (/opt/mpich/gnu/bin/mpirun)
launcher on your PBS script too.
This will help to check if you have the basic functionality of
Torque and MPICH working.
Gus Correa
---------------------------------------------------------------------
Gustavo Correa, PhD - Email: g...@ldeo.columbia.edu
Lamont-Doherty Earth Observatory - Columbia University
P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------
[
Yes, the 32-bit (i686) executable is likely to work on a 64-bit (x86_64)
processor,
AMD or Intel. It will run in 32-bit mode.
Since you've got the version optimized for Pentium 4 and above,
you need to make sure it matches your processors.
Most likely yes, since they are 64-bit, which came after Pentium 4.
Which processors do you have?
Try my suggestions below, for when you go back to PCGAMESS.
Jonas Baltrusaitis wrote:
> If I do # qsub -l nodes=mr9540 testme.csh it gives me error qsub: Job exceeds queue resource limits MSG=cannot locate feasible nodes
>
Still running as root, as the # prompt above suggests?
The rootly super power doesn't help here.
> even after removing the FQDN from nodelist.
The FQDN associates to the frontend eth1,
your outside-looking Internet interface.
The cluster-local frontend name (mr9540 I presume) associates to the
frontend eth0,
which is the inside looking interface,
part of the 10.0.0.0 (unless you changed it) private subnet
that all compute nodes belong to.
The latter should be used by Torque and by MPI,
as the traffic of both flows across the private subnet.
Did you restart the pbs_server, after editing the nodes file?
> If I do # qsub -l nodes=compute-0-0 testme.csh it successfully rings in the que and finishes with two empty files. I assume that's a success
>
There seems to be a minor glitch.
On "testme.csh" try
echo $HOSTNAME
instead of
echo $hostname.
The environment variable (uppercase) is defined naturally by your login
shell.
Bart's script is meant to write the hostname(s) to the PBS script
"stdout" file.
Your "finishes with two empty files", if you are referring to stdout and
stderr,
is not necessarily success.
Again, if you are running as root, you are asking for troubling confusion.
Submit jobs as a regular user.
> I will try to run compute-0-0 instead of $PBS_NODEFILE and see if I can get any of the prorgams going. BTW, where is the $PBS_NODEFILE and how can I cahnge it so it omits frontend for now?
>
>
Make it easy on yourself.
Remove the frontend from the node file for now, restart the pbs_server,
as Bart and Brandon suggested.
You can add it later, after you sort this out.
Good luck!
Gus Correa
# set so no ssh problems with pcgamess
export P4_RSHCOMMAND=ssh
As suggested on the PCGamess/ Firefly web page?
I had your problem when I 1st started. I placed:
# set so no ssh problems with pcgamess
export P4_RSHCOMMAND=ssh
In my /etc/profile file and now the problem is resolved.
Jim
> -----Original Message-----
> From: npaci-rocks-dis...@sdsc.edu
> [mailto:npaci-rocks-dis...@sdsc.edu] On Behalf Of
> Jonas Baltrusaitis
It doesn't appear that you have set up your ssh key.
This process will make the files:
/root/.ssh/id_rsa.pub
/root/.ssh/id_rsa
/root/.ssh/authorized_keys
Generating public/private rsa key pair.
inEnter file in which to save the key (/root/.ssh/id_rsa)
What do I enter for file name and, later on, for passwords?
Ian Kaufman
Research Systems Administrator
Jacobs School of Engineering
ikau...@soe.ucsd.edu x49716
> -----Original Message-----
> From: npaci-rocks-dis...@sdsc.edu [mailto:npaci-rocks-
> discussio...@sdsc.edu] On Behalf Of Jonas Baltrusaitis
> Sent: Monday, December 22, 2008 12:00 PM
> To: Discussion of Rocks Clusters
$ cp /opt/mpi-tests/src/*.c .
$ cp /opt/mpi-tests/src/Makefile .
$ make
[jbaltrus@mr9540 test]$ cp /opt/mpi-tests/src/*.c
cp: cannot create regular file `/opt/mpi-tests/src/mpi-verify.c': Permission denied
[jbaltrus@mr9540 test]$
Following that I create file /test/mpi-ring.qsub with
#!/bin/bash
#
#$ -cwd
#$ -j y
#$ -S /bin/bash
#
/opt/openmpi/bin/mpirun -np $NSLOTS $HOME/test/mpi-ring
and when I submit it with [jbaltrus@mr9540 test]$ qsub -pe orte 4 mpi-ring.qsub
I get
Warning: Permanently added 'compute-0-0.local' (RSA) to the list of known hosts.
[compute-0-0.local:16126] [0,0,1] ORTE_ERROR_LOG: Not found in file odls_default_module.c at line 1191
--------------------------------------------------------------------------
Failed to find or execute the following executable:
Host: compute-0-0.local
Executable: /home/jbaltrus/test/mpi-ring
Cannot continue.
--------------------------------------------------------------------------
[compute-0-0.local:16126] [0,0,1] ORTE_ERROR_LOG: Not found in file orted.c at line 626
Could anybody tell me what's wrong with SGE/permissions?
thanks
Jonas
Jonas
#!/bin/bash
#$ -N xxxr
#$ -S /bin/bash
#MPI is also available. Simply substitute "mpi" for "mpich"
#$ -pe mpich 4
#$ -cwd
#$ -o xxxr.out
#$ -e xxxr.err
#$ -notify
/opt/mpich/gnu/bin/mpirun -np $NSLOTS -machinefile $TMPDIR/machines /share/apps/orca_amd64_exe/orca
Incidentally, how do I add frontend to so I can run jobs on it using SGE?
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
53 0.55500 RhC20As4H3 jbaltrus qw 12/27/2008 16:59:10 4
[jbaltrus@mr9540 RhC20As4H32Cl_CO2]$
The other question remains: how do I setup SGE so it submits jobs not only to computes, but also to the frontend (which has as many processors as computes do)
--- On Sat, 12/27/08, Jonas Baltrusaitis <jasi...@yahoo.com> wrote:
--
Hung-Sheng Tsao, Ph.D. (LaoTsao) Sr. System Engineer
US, GEH East TS Ambassador
400 Atrium Dr, 1ST FLOOR P/F:1877 319 0460 (x67079)
Somerset, NJ 08873 C: 973 495 0840
http://blogs.sun.com/hstsao/ E:Hung-Sh...@sun.com
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NOTICE: This email message is for the sole use of the intended
recipient(s) and may contain confidential and privileged
information. Any unauthorized review, use, disclosure or
distribution is prohibited. If you are not the intended
recipient, please contact the sender by reply email and destroy
all copies of the original message.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[root@mr9540 ~]# /opt/gridengine/install_execd
/opt/gridengine/install_execd: line 34: /root/inst_sge: No such file or directory
/opt/gridengine/install_execd: line 34: exec: /root/inst_sge: cannot execute: No such file or directory
[root@mr9540 ~]# qmon
Error: Can't open display:
[root@mr9540 ~]# qconf
--- On Sun, 12/28/08, Dr. Hung-Sheng Tsao (LaoTsao) <Hung-Sh...@sun.com> wrote:
cd /opt/gridengine
./install_execd
answer all default
May want to reduce the slot for the frontend so U have CPU resource for
other usage for ROCKS.
qconf -mattr queue slots '[<frontend>=4]' all.q
<frontend> is the output of frontend name in
qconf -sq all.q |grep slots
hth
[jbaltrus@mr9540 RhC20As4H32Cl_CO2]$ qconf -sq all.q |grep slots
slots 1,[compute-0-0.local=4],[compute-0-1.local=4], \
[jbaltrus@mr9540 RhC20As4H32Cl_CO2]$
[root@mr9540 ~]# cd /var/tmp
[root@mr9540 tmp]# qconf -sq all.q >all.q.out
[root@mr9540 tmp]# grep mr9540 all.q.out
[mr9540.local=1]
[root@mr9540 tmp]# qconf -sq all.q |grep slots
slots 1,[compute-0-0.local=4],[compute-0-1.local=4], \
just do
qconf -mattr queue slots '[mr9540.local=4]' all.q
to assign 4 slots for mr9540.local
verify it
qconf -sq all.q |grep mr9540
thanks
Jonas
--
top - 17:11:26 up 6 days, 3:29, 3 users, load average: 6.42, 6.19, 6.06
Tasks: 159 total, 6 running, 153 sleeping, 0 stopped, 0 zombie
Cpu(s): 26.7%us, 71.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 2.0%si, 0.0%st
Mem: 8124416k total, 5889708k used, 2234708k free, 352236k buffers
Swap: 4923868k total, 0k used, 4923868k free, 4331840k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7535 jbaltrus 25 0 894m 209m 18m S 33.3 2.6 48:50.51 pcgamess
7536 jbaltrus 25 0 894m 198m 18m S 33.3 2.5 48:50.47 pcgamess
7537 jbaltrus 25 0 893m 198m 18m R 33.3 2.5 48:50.44 pcgamess
1 root 15 0 10324 692 580 S 0.0 0.0 0:00.20 init
2 root RT -5 0 0 0 S 0.0 0.0 0:00.01 migration/0
3 root 34 19 0 0 0 S 0.0 0.0 0:00.01 ksoftirqd/0
4 root RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/0
26 root 10 -5 0 0 0 S 0.0 0.0 0:00.15 events/0
34 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 khelper
35 root 13 -5 0 0 0 S 0.0 0.0 0:00.00 kthread
37 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 xenwatch
38 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 xenbus
47 root 10 -5 0 0 0 S 0.0 0.0 0:00.02 kblockd/0
55 root 20 -5 0 0 0 S 0.0 0.0 0:00.00 kacpid
196 root 20 -5 0 0 0 S 0.0 0.0 0:00.00 cqueue/0
207 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 khubd
209 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 kseriod
310 root 25 0 0 0 0 S 0.0 0.0 0:00.00 pdflush
311 root 15 0 0 0 0 S 0.0 0.0 0:01.06 pdflush
312 root 20 -5 0 0 0 S 0.0 0.0 0:00.00 kswapd0
313 root 20 -5 0 0 0 S 0.0 0.0 0:00.00 aio/0
463 root 11 -5 0 0 0 S 0.0 0.0 0:00.00 kpsmoused
542 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 ata/0
550 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 ata_aux
560 root 13 -5 0 0 0 S 0.0 0.0 0:00.00 scsi_eh_0
561 root 14 -5 0 0 0 S 0.0 0.0 0:00.00 scsi_eh_1
562 root 11 -5 0 0 0 S 0.0 0.0 0:00.01 scsi_eh_2
563 root 11 -5 0 0 0 S 0.0 0.0 0:00.01 scsi_eh_3
564 root 10 -5 0 0 0 S 0.0 0.0 0:00.24 kjournald
591 root 10 -5 0 0 0 S 0.0 0.0 0:00.00 kauditd
623 root 21 -4 12636 820 372 S 0.0 0.0 0:00.33 udevd
1970 root 17 0 3760 380 316 S 0.0 0.0 0:00.00 change_console
2131 root 16 -5 0 0 0 S 0.0 0.0 0:00.00 kmpathd/0
2175 root 10 -5 0 0 0 S 0.0 0.0 0:01.50 kjournald
2177 root 10 -5 0 0 0 S 0.0 0.0 0:00.36 kjournald
2825 root 12 -3 18356 712 512 S 0.0 0.0 0:00.01 auditd
2827 root 12 -3 16228 740 596 S 0.0 0.0 0:00.01 audispd
3009 root 18 0 170m 8976 2380 S 0.0 0.1 4:22.22 greceptor
3020 root 15 0 10088 772 596 S 0.0 0.0 0:00.30 syslogd
3023 root 15 0 3784 424 340 S 0.0 0.0 0:00.00 klogd
3036 root 18 0 10708 392 248 S 0.0 0.0 0:00.41 irqbalance
3085 rpc 15 0 8028 652 516 S 0.0 0.0 0:00.01 portmap
3100 sge 25 0 231m 4140 2684 S 0.0 0.1 0:08.44 sge_qmaster
3106 root 18 0 10120 768 636 S 0.0 0.0 0:00.00 rpc.statd
3120 sge 15 0 171m 2372 1760 S 0.0 0.0 0:00.90 sge_schedd
3149 root 15 0 48648 700 272 S 0.0 0.0 0:00.02 rpc.idmapd
3175 nobody 18 0 105m 1412 852 S 0.0 0.0 6:48.01 gmetad
3195 dbus 15 0 21376 876 556 S 0.0 0.0 0:00.84 dbus-daemon
3248 root 21 0 39892 1616 1204 S 0.0 0.0 0:00.05 automount
3267 root 18 0 3780 564 460 S 0.0 0.0 0:00.00 acpid
3367 root 15 0 142m 6416 3004 S 0.0 0.1 0:03.86 snmpd
3382 root 15 0 60528 1220 672 S 0.0 0.0 0:00.06 sshd
I am not sure I understood what you said.
How did you launch PCGAMESS?
What is the mpirun command?
How many processes did you launch (2, 3, 4 or 8?),
and what is in your machines file (or equivalent from Torque or SGE)?
Where does this "top" output come from? (frontend, or which compute node?)
Somehow the program seems to run on a single CPU,
three processes sharing it 33% for each one, as shown on the %CPU column
of top.
While on top, type "1" (number one), to show the activity on each CPU/core.
Gus Correa
This is my submit line
/opt/mpich/gnu/bin/mpirun -np $NSLOTS -machinefile $TMPDIR/machines /share/apps/pcg71c/pcgamess -i /home/jbaltrus/RhC20As4H32Cl_CO2/RhC20As4H32Cl_CO2.inp -ex /share/apps/pcg71c -t /share/apps/scratch/tmp
Since I am confused I will appreciate any help
Jonas
PS why would frontend cpus be involved in pcgamess jobs anyway?.. something is not right. I am running SGE. HOw do I check machines file?
--- On Mon, 12/29/08, Gus Correa <g...@ldeo.columbia.edu> wrote:
> From: Gus Correa <g...@ldeo.columbia.edu>
> Subject: Re: [Rocks-Discuss] frontend slow
> To: "Discussion of Rocks Clusters" <npaci-rocks...@sdsc.edu>
qstat -f should show U what's running on execd-host
the load ave is 6.5 for 8CPU system and some 2.2GB memory free
try to reduce the frontend slot to 1,
qdel <JID> to delete job
and submit again through qsub
all job need to be submitted by qsub
--
queuename qtype used/tot. load_avg arch states
----------------------------------------------------------------------------
al...@compute-0-0.local BIP 4/4 4.07 lx26-amd64
77 0.55500 RhC20As4H3 jbaltrus r 12/29/2008 16:21:11 4
----------------------------------------------------------------------------
al...@compute-0-1.local BIP 4/4 4.05 lx26-amd64
74 0.55500 Co2crypt2_ jbaltrus r 12/29/2008 15:34:26 4
----------------------------------------------------------------------------
al...@mr9540.local BIP 0/4 6.18 lx26-amd64 a
[jbaltrus@mr9540 ~]$
--- On Mon, 12/29/08, Dr. Hung-Sheng Tsao (LaoTsao) <Hung-Sh...@sun.com> wrote:
> From: Dr. Hung-Sheng Tsao (LaoTsao) <Hung-Sh...@sun.com>
> Subject: Re: [Rocks-Discuss] frontend slow
> To: "Discussion of Rocks Clusters" <npaci-rocks...@sdsc.edu>
An SGE user or expert may help you better than me.
I use Torque/PBS instead.
However, from what you said,
it is clear that SGE is directing your jobs, or part of them,
to the frontend.
You may want to show your full SGE scripts for each of the two jobs.
With the info you sent,
I don't know what is the value of $NSLOTS, for instance.
On Torque/PBS the "machines" file where the job will run is generated
on the fly, depending on available resources, but can be accessed
through the environment variable $PBS_NODEFILE
from within the Torque/PBS script.
Most likely SGE has a similar mechanism.
Moreover, there is a configuration file that tells Torque/PBS how many
nodes you have in the cluster, and how many CPUs you have on each.
SGE should have a similar feature.
This is the master node file that you need to look for and get right.
BTW, did you type "1" (number one) while on top?
What is the output?
#!/bin/bash
#$ -N RhC20As4H32Cl_CO2
#$ -S /bin/bash
#MPI is also available. Simply substitute "mpi" for "mpich"
#$ -pe mpich 4
#$ -cwd
#$ -o RhC20As4H32Cl_CO2.out
#$ -e RhC20As4H32Cl_CO2.err
#$ -notify
/opt/mpich/gnu/bin/mpirun -np $NSLOTS -machinefile $TMPDIR/machines /share/apps/pcg71c/pcgamess -i /home/jbaltrus/RhC20As4H32Cl_CO2/RhC20As4H32Cl_CO2.inp -ex /share/apps/pcg71c -t /share/apps/scratch/tmp
--- On Mon, 12/29/08, Gus Correa <g...@ldeo.columbia.edu> wrote:
may be do
pkill pcgames on frontend to kill them
what is the script(whole) that U used to submit the jobs?
Any chances that you may have inadvertently launched a
PCGAMESS run on the frontend using mpirun directly, but not SGE?
If this is the case, I guess SGE won't know about that run,
and won't report it as a job in qstat (as you show below).
What is the output of
ps -u jbaltrus
on the frontend?
What is the output of
cluster-fork ' ps -u jbaltrus' ?
Gus Correa
top - 18:56:44 up 6 days, 5:14, 3 users, load average: 6.55, 6.22, 6.08
Tasks: 159 total, 7 running, 152 sleeping, 0 stopped, 0 zombie
Cpu0 : 27.3%us, 70.4%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.1%hi, 2.2%si, 0.0%st
Mem: 8124416k total, 5935216k used, 2189200k free, 355820k buffers
Swap: 4923868k total, 0k used, 4923868k free, 4372760k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7537 jbaltrus 25 0 893m 198m 18m S 33.3 2.5 83:45.53 pcgamess
7535 jbaltrus 25 0 894m 209m 18m R 33.0 2.6 83:45.91 pcgamess
7536 jbaltrus 25 0 894m 198m 18m R 33.0 2.5 83:45.89 pcgamess
2175 root 10 -5 0 0 0 S 0.0 0.0 0:01.69 kjournald
2177 root 10 -5 0 0 0 S 0.0 0.0 0:00.36 kjournald
2825 root 12 -3 18356 712 512 S 0.0 0.0 0:00.01 auditd
2827 root 12 -3 16228 740 596 S 0.0 0.0 0:00.01 audispd
3009 root 18 0 170m 8976 2380 S 0.0 0.1 4:25.54 greceptor
3020 root 15 0 10088 772 596 S 0.0 0.0 0:00.30 syslogd
3023 root 15 0 3784 424 340 S 0.0 0.0 0:00.00 klogd
3036 root 18 0 10708 392 248 S 0.0 0.0 0:00.41 irqbalance
3085 rpc 15 0 8028 652 516 S 0.0 0.0 0:00.01 portmap
3100 sge 25 0 231m 4140 2684 S 0.0 0.1 0:08.65 sge_qmaster
3106 root 18 0 10120 768 636 S 0.0 0.0 0:00.00 rpc.statd
3120 sge 15 0 171m 2372 1760 S 0.0 0.0 0:00.90 sge_schedd
3149 root 15 0 48648 700 272 S 0.0 0.0 0:00.02 rpc.idmapd
3175 nobody 18 0 105m 1412 852 S 0.0 0.0 6:52.47 gmetad
3195 dbus 15 0 21376 876 556 S 0.0 0.0 0:00.84 dbus-daemon
3248 root 21 0 39892 1616 1204 S 0.0 0.0 0:00.05 automount
3267 root 18 0 3780 564 460 S 0.0 0.0 0:00.00 acpid
3367 root 15 0 142m 6416 3004 S 0.0 0.1 0:03.86 snmpd
3382 root 15 0 60528 1220 672 S 0.0 0.0 0:00.06 sshd
--- On Mon, 12/29/08, Gus Correa <g...@ldeo.columbia.edu> wrote:
> From: Gus Correa <g...@ldeo.columbia.edu>
> Subject: Re: [Rocks-Discuss] frontend slow
> To: "Discussion of Rocks Clusters" <npaci-rocks...@sdsc.edu>
ps -u
[jbaltrus@mr9540 ~]$ ps -u jbaltrus
PID TTY TIME CMD
7018 ? 00:11:26 pcgamess
7180 ? 00:00:00 pcgamess
7184 ? 00:09:35 pcgamess
7353 ? 00:00:00 pcgamess
7355 ? 00:23:22 pcgamess
7517 ? 00:00:00 pcgamess
7518 ? 00:00:00 pcgamess
7520 ? 00:00:00 pcgamess
7521 ? 00:00:00 pcgamess
7535 ? 01:24:06 pcgamess
7536 ? 01:24:06 pcgamess
7537 ? 01:24:05 pcgamess
11323 ? 00:00:00 sshd
11324 ? 00:00:00 sftp-server
11377 pts/3 00:00:00 bash
11659 pts/3 00:00:00 ps
[jbaltrus@mr9540 ~]$
clutser fork
[jbaltrus@mr9540 ~]$ cluster-fork ' ps -u jbaltrus'
compute-0-0:
PID TTY TIME CMD
3784 ? 00:00:00 bash
3785 ? 00:00:00 mpirun
3925 ? 02:35:16 pcgamess
3926 ? 00:00:00 pcgamess
3927 ? 00:00:00 ssh
3930 ? 00:00:00 sshd
3931 ? 02:36:53 pcgamess
4084 ? 00:00:00 ssh
4085 ? 00:00:00 pcgamess
4088 ? 00:00:00 sshd
4089 ? 02:36:50 pcgamess
4242 ? 00:00:00 ssh
4243 ? 00:00:00 pcgamess
4246 ? 00:00:00 sshd
4247 ? 02:36:08 pcgamess
4400 ? 00:00:00 pcgamess
4401 ? 00:00:00 pcgamess
4402 ? 00:00:00 pcgamess
4403 ? 00:00:00 pcgamess
4404 ? 00:00:00 pcgamess
4429 ? 00:00:09 pcgamess
4430 ? 00:00:47 pcgamess
4431 ? 00:00:09 pcgamess
4432 ? 00:00:09 pcgamess
4533 ? 00:00:00 sshd
4534 ? 00:00:00 ps
compute-0-1:
PID TTY TIME CMD
3169 ? 00:00:00 bash
3170 ? 00:00:00 mpirun
3313 ? 03:21:32 pcgamess
3314 ? 00:00:00 pcgamess
3315 ? 00:00:00 ssh
3318 ? 00:00:00 sshd
3319 ? 03:23:39 pcgamess
3472 ? 00:00:00 pcgamess
3473 ? 00:00:00 ssh
3476 ? 00:00:00 sshd
3477 ? 03:23:41 pcgamess
3630 ? 00:00:00 ssh
3631 ? 00:00:00 pcgamess
3634 ? 00:00:00 sshd
3635 ? 03:22:23 pcgamess
3788 ? 00:00:00 pcgamess
3789 ? 00:00:00 pcgamess
3790 ? 00:00:00 pcgamess
3791 ? 00:00:00 pcgamess
3792 ? 00:00:00 pcgamess
3814 ? 00:00:51 pcgamess
3815 ? 00:00:10 pcgamess
3816 ? 00:00:11 pcgamess
3817 ? 00:00:10 pcgamess
3944 ? 00:00:00 sshd
3945 ? 00:00:00 ps
[jbaltrus@mr9540 ~]$
--- On Mon, 12/29/08, Gus Correa <g...@ldeo.columbia.edu> wrote:
--
You clearly have a large number of pcgamess processes on the frontend
and on the compute nodes.
Most likely a number of them are just leftovers of previous runs,
which were not killed properly.
12 on the frontend, 16 on compute-0-0, another 16 on compute-0-1.
Note how different the PID numbers and the process TIMEs are!
You may be able to kill some using SGE (qdel or equivalent).
However, after you do this, you need to check for what was
left, and kill them by hand, using kill -9 $PID (whatever the PID is),
both on the frontend and the nodes.
An alternative is to reboot all machines.
Then, start fresh! :)
The clean way to submit jobs is to use only your resource manager
(SGE qsub in your case), not mpirun directly.
The clean way to kill jobs is to use only your resource manager
(SGE qdel in your case), not kill -9 or things the like.
I hope this helps,
Jonas Baltrusaitis wrote:
> thank you both so much. however, I only use sge to submit and/or kill jobs with the below mentioned script? where did the leftovers come from?
>
>
Hard to tell.
Did you restart the SGE daemon along the way, by any means?
I don't know anything about SGE, I don't use it.
Dr. Tsao may have a better guess.
However, the fix is to start fresh.
It may be easier to
just reboot all machines (clean, with shutdown -r), and give it another try.
Gus Correa
Here is another line of though.
Jonas Baltrusaitis wrote:
> whne I type 1
>
> top - 18:56:44 up 6 days, 5:14, 3 users, load average: 6.55, 6.22, 6.08
> Tasks: 159 total, 7 running, 152 sleeping, 0 stopped, 0 zombie
> Cpu0 : 27.3%us, 70.4%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.1%hi, 2.2%si, 0.0%st
>
That is weird.
If you have 8 cores on the frontend, after you type "1",
top should have shown 8 lines, one for each core/CPU:
Cpu0 ...
Cpu1 ...
...
Cpu7 ...
However, it only shows one, for Cpu0.
What is the output of "uname -a" on the frontend?
What is the output of "cat /proc/cpuinfo" on the frontend?
Gus Correa
my thinking is that U may have use -pe mpi before,
qdel would not delete all the mpi jobs because of loosely integration
Jonas Baltrusaitis wrote:
> I'll reiterate: I only use sge to submit kill the jobs. any idea where those other came from?
>
>
I'll reiterate: we don't know. :)
Bear with us, let's try to find out what is going wrong.
Gus Correa
[jbaltrus@mr9540 ~]$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Xeon(R) CPU E5405 @ 2.00GHz
stepping : 10
cpu MHz : 1995.025
cache size : 6144 KB
physical id : 0
siblings : 1
core id : 0
cpu cores : 1
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu tsc msr pae mce cx8 apic mtrr mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni tm2 lahf_lm
bogomips : 4989.34
clflush size : 64
cache_alignment : 64
address sizes : 38 bits physical, 48 bits virtual
power management:
[jbaltrus@mr9540 ~]$
Hmmm ....
Are you sure the frontend is a dual-socket quad-core (total 8-cores)
machine?
It seems to have an SMP kernel,
but cpuinfo only reports one processor with a single core.
I expected to see 8 sections on this report, one for each of
processor0, ...., processor7 (i.e. the 8 cores).
Also, I am not familiar to the latest kernel naming convention,
but yours is called 2.6.18-92.1.13.el5xen.
Did you install the xen roll?
I never used it, but I wonder if it would hide the actual cores,
to play with virtualization.
What is the output of "uname -a" and of "cat /proc/cpuinfo" on compute-0-0?
(Just to compare with the frontend results.)
from what I am reading on the web, one need to check Ur inp file U used
there are some parms that tell pcgamess to use more than one CPU to run.
please share Ur inp file
$contrl scftyp=UHF runtyp=optimize icharg=4 mult=11 maxit=300 ECP=read $end
$contrl dfttyp=b3lyp nzvar=0 $end
$system timlim=999999999 MWORDS=100 kdiag=0 $end
!$basis gbasis=N31 ngauss=6 ndfunc=1 $end
$SCF DIRSCF=.T. FDIFF=.t. NPUNCH=0 $END
$ZMAT DLC=.T. AUTO=.T. $END
$STATPT NSTEP=200 OPTTOL=0.0005 NPRT=-2 NPUN=-2 HSSEND=.t. $END
$p2p p2p=.t. dlb=.t. $end
$GUESS guess=huckel kdiag=0 $END
$CONTRL COORD=UNIQUE $END
$DATA
Co2cryptate2 + CO2 b3lyp/ lanl2dz on Co and 6-31G* on others
C1
COBALT 27.0 0.0440040000 0.0268240000 -0.0453410000
Compute nodes show 4 core processors as normal
[jbaltrus@compute-0-0 ~]$ uname -a
Linux compute-0-0.local 2.6.18-92.1.13.el5 #1 SMP Wed Sep 24 19:32:05 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
[jbaltrus@compute-0-0 ~]$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Core(TM)2 Quad CPU Q6700 @ 2.66GHz
stepping : 11
cpu MHz : 2667.000
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips : 5323.69
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Core(TM)2 Quad CPU Q6700 @ 2.66GHz
stepping : 11
cpu MHz : 2667.000
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 1
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips : 5319.88
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
processor : 2
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Core(TM)2 Quad CPU Q6700 @ 2.66GHz
stepping : 11
cpu MHz : 2667.000
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 2
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips : 5320.00
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Core(TM)2 Quad CPU Q6700 @ 2.66GHz
stepping : 11
cpu MHz : 2667.000
cache size : 4096 KB
physical id : 0
siblings : 4
core id : 3
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips : 5319.85
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
thanks
JOnas
do man qsub to see the environment variable available to U.
U can put some statements e.g.
echo JOB_ID=$JOB_ID
echo JOB_HOME=$JOB_HOME
etc to find out the real value for each run
1) First and foremost, by all means, follow Dr. Tsao's advice on how to
setup PCGAMESS to run under SGE.
In particular, avoid using the master/frontend for now, as he suggested.
2) It is clear that you have different kernels
on the frontend (2.6.18-92.1.13.el5xen #1 SMP)
and on the compute nodes (2.6.18-92.1.13.el5 #1 SMP).
The former has xen, the latter doesn't.
3) Since the compute nodes report the correct number of CPUs/cores (4),
but the frontend reports only one CPU/core instead of the right number (8),
I suspect the xen kernel is hiding the actual number of cores,
to play virtualization.
This is a guess, admittedly, but founded on the evidence you sent us.
4) You can get another piece of evidence if you submit a 4-processor
PCGAMESS job to compute-0-0 via SGE, login to compute-0-0, do "top" there,
and type "1" (number one) within top.
I expect this will report all 4 CPUs, and there won't be any sharing
(not 25% for each), but around 99-100% CPU activity on each pcgamess
process.
This is in contrast to what you saw on the frontend.
Please, do the experiment, and send the result.
5) Why and how you got to this state of affairs,
with different kernels on the frontend and on the nodes,
is a mystery that only you, if anybody, can tell.
You may have installed more the xen roll on the frontend,
after you installed the compute nodes, for instance.
6) Do you need xen?
If your main goal is to run parallel jobs in computational Chemistry
using MPI,
I would say your business is not virtualization, and you don't need xen.
Actually, you may really want to stay away of it,
if standard parallel computing is all you want.
(Search the list archives for different opinions about this.)
7) There may be a way to fix the frontend, but I don't know how to.
I don't use xen, never installed it.
You may want to start a new thread with this question, i.e.,
how to remove xen from your frontend, or at least how to make it report
and use the correct number of physical CPUs/cores you have even if xen
is kept.
8) I am not sure this is feasible,
but I am reluctant to tell you to reinstall the cluster from scratch,
without xen.
(Reinstalling would give you a homogeneous cluster, though.)
The Rocks developers and other list subscribers experienced with xen may
advise you better.
I hope this helps.
Gus Correa
---------------------------------------------------------------------
Gustavo Correa, PhD - Email: g...@ldeo.columbia.edu
Lamont-Doherty Earth Observatory - Columbia University
---------------------------------------------------------------------
May be this is a bug or feature in this rocks5-1 xen build
One can understand that if U are using xen to build domU then U donot
want dom0 to use all CPU by assign it only vcpu 1
But If U donot want to use xen and vm for any other domU then it is
better to have dom0 present to U all the vcpu:-)
IMHO, even if one choose the xen roll, one need grub/menu.lst to provide
one to choose not to boot into dom0
I change the title a bit
happy holiday
-------------- next part --------------
A non-text attachment was scrubbed...
Name: hung-sheng_tsao.vcf
Type: text/x-vcard
Size: 366 bytes
Desc: not available
Url : https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20081230/6832e96a/hung-sheng_tsao.vcf
-------------- next part --------------
A non-text attachment was scrubbed...
Name: hung-sheng_tsao.vcf
Type: text/x-vcard
Size: 366 bytes
Desc: not available
--- On Tue, 12/30/08, Dr. Hung-Sheng Tsao (LaoTsao) <Hung-Sh...@sun.com> wrote:
> From: Dr. Hung-Sheng Tsao (LaoTsao) <Hung-Sh...@sun.com>
[root@mr9540 ~]# xm vcpu-set 0 8
[root@mr9540 ~]# cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Xeon(R) CPU E5405 @ 2.00GHz
stepping : 10
cpu MHz : 1995.025
cache size : 6144 KB
physical id : 0
siblings : 1
core id : 0
cpu cores : 1
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu tsc msr pae mce cx8 apic mtrr mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni tm2 lahf_lm
bogomips : 4989.34
clflush size : 64
cache_alignment : 64
address sizes : 38 bits physical, 48 bits virtual
power management:
processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Xeon(R) CPU E5405 @ 2.00GHz
stepping : 10
cpu MHz : 1995.025
cache size : 6144 KB
physical id : 1
siblings : 1
core id : 0
cpu cores : 1
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu tsc msr pae mce cx8 apic mtrr mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni tm2 lahf_lm
bogomips : 4989.34
clflush size : 64
cache_alignment : 64
address sizes : 38 bits physical, 48 bits virtual
power management:
processor : 2
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Xeon(R) CPU E5405 @ 2.00GHz
stepping : 10
cpu MHz : 1995.025
cache size : 6144 KB
physical id : 2
siblings : 1
core id : 0
cpu cores : 1
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu tsc msr pae mce cx8 apic mtrr mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni tm2 lahf_lm
bogomips : 4989.34
clflush size : 64
cache_alignment : 64
address sizes : 38 bits physical, 48 bits virtual
power management:
processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Xeon(R) CPU E5405 @ 2.00GHz
stepping : 10
cpu MHz : 1995.025
cache size : 6144 KB
physical id : 3
siblings : 1
core id : 0
cpu cores : 1
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu tsc msr pae mce cx8 apic mtrr mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni tm2 lahf_lm
bogomips : 4989.34
clflush size : 64
cache_alignment : 64
address sizes : 38 bits physical, 48 bits virtual
power management:
processor : 4
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Xeon(R) CPU E5405 @ 2.00GHz
stepping : 10
cpu MHz : 1995.025
cache size : 6144 KB
physical id : 4
siblings : 1
core id : 0
cpu cores : 1
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu tsc msr pae mce cx8 apic mtrr mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni tm2 lahf_lm
bogomips : 4989.34
clflush size : 64
cache_alignment : 64
address sizes : 38 bits physical, 48 bits virtual
power management:
processor : 5
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Xeon(R) CPU E5405 @ 2.00GHz
stepping : 10
cpu MHz : 1995.025
cache size : 6144 KB
physical id : 5
siblings : 1
core id : 0
cpu cores : 1
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu tsc msr pae mce cx8 apic mtrr mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni tm2 lahf_lm
bogomips : 4989.34
clflush size : 64
cache_alignment : 64
address sizes : 38 bits physical, 48 bits virtual
power management:
processor : 6
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Xeon(R) CPU E5405 @ 2.00GHz
stepping : 10
cpu MHz : 1995.025
cache size : 6144 KB
physical id : 6
siblings : 1
core id : 0
cpu cores : 1
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu tsc msr pae mce cx8 apic mtrr mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni tm2 lahf_lm
bogomips : 4989.34
clflush size : 64
cache_alignment : 64
address sizes : 38 bits physical, 48 bits virtual
power management:
processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Xeon(R) CPU E5405 @ 2.00GHz
stepping : 10
cpu MHz : 1995.025
cache size : 6144 KB
physical id : 7
siblings : 1
core id : 0
cpu cores : 1
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu tsc msr pae mce cx8 apic mtrr mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm constant_tsc pni tm2 lahf_lm
bogomips : 4989.34
clflush size : 64
cache_alignment : 64
address sizes : 38 bits physical, 48 bits virtual
power management:
--- On Tue, 12/30/08, Dr. Hung-Sheng Tsao (LaoTsao) <Hung-Sh...@sun.com> wrote:
> From: Dr. Hung-Sheng Tsao (LaoTsao) <Hung-Sh...@sun.com>
--- On Tue, 12/30/08, Dr. Hung-Sheng Tsao (LaoTsao) <Hung-Sh...@sun.com> wrote:
> From: Dr. Hung-Sheng Tsao (LaoTsao) <Hung-Sh...@sun.com>