12/13/2007 15:18:42|qmaster|mgmtHost|W|job 854800.1 failed on host
compute-1-1.local general assumedly before job because: can't get
password entry for user "username". Either the user does not exist
or NIS error!
I found a message on this mailing list dated 9/27/2005 which suggests to
fix this problem by retrieving the latest files from the frontend by
using the 411get command as:
# cluster-fork --node="compute-1-1.local" 411get --all
We are so confused with this error and not sure if it is related with
the fact we are increasing the number of compute nodes on the cluster.
I say that because the frequency this problem occurs increased since two
weeks ago when we added the actual 25% of the nodes into the cluster
(actually we have 185 dual-processor V20z).
As a temporary solutions (we are monitoring if it already works) was to
create a cron running each 10 minutes to retrieve the latest files front
the frontend and decrease the polling interval to 5 minutes, but we are
looking for a final solution to fix this problem. It is possible this
issue has been solved or is it a SGE bug? a Rocks bug? Any help would
be greatly appreciated.
Best,
- Hugo
--
Hugo R. Hernandez-Mora
System Administrator
Laboratory of Neuro Imaging, UCLA
635 Charles E. Young Drive South, Suite 225
Los Angeles, CA 90095-7332
Tel: 310.267.5076
Fax: 310.206.5518
hugo.he...@loni.ucla.edu
--
"Si seus esforços, foram vistos com indefrença, não desanime,
que o sol faze un espectacolo maravilhoso todas as manhãs
cuando a maior parte das pessoas, ainda estam durmindo"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20071213/cadfb5c1/attachment.html
# rocks sync users
After adding (or modifying) any user account. You must do this after
adding any new user accounts.
On Dec 13, 2007 6:59 PM, Hugo R. Hernandez-Mora
--
-mjk
As a note of clarification, we have an application we use to submit
jobs to our cluster -- it submits the job as an account defined for
the app, then before executing performs an su to the real user's
account. The issue we are experiencing here (as reported by the GE
error) isn't with new accounts, but rather with a single account that
is known at a certain point to exist in the passwd file but (randomly)
seems to disappear. During the 411get, is there ever a period where
the passwd file is either empty or resets to a default?
Thank you very much,
Jonathan
Jonathan Pierce
Systems Administrator
Laboratory of Neuro Imaging, UCLA
635 Charles E. Young Drive South,
Suite 225 Los Angeles, CA 90095-7334
Tel: 310.267.5076
Cell: 310.487.8365
Fax: 310.206.5518
jonatha...@loni.ucla.edu
-P
On Dec 14, 2007 4:12 PM, Jonathan Pierce <jonatha...@loni.ucla.edu>
wrote:
--
Philip Papadopoulos, PhD
University of California, San Diego
858-822-3628
-------------- next part --------------
An HTML attachment was scrubbed...
On Dec 14, 2007 4:12 PM, Jonathan Pierce <jonatha...@loni.ucla.edu> wrote:
--
-mjk
--
Hugo R. Hernandez-Mora, M.Sc.
System Administrator
Laboratory of Neuro Imaging, UCLA
635 Charles E. Young Drive South, Suite 225
Los Angeles, CA 90095-7332
Tel: 310.267.5076
Fax: 310.206.5518
hugo.he...@loni.ucla.edu
--
"Si seus esforços, foram vistos com indefrença, não desanime,
que o sol faze un espectacolo maravilhoso todas as manhãs
cuando a maior parte das pessoas, ainda estam durmindo"
________________________________________
From: Philip Papadopoulos [philip.pa...@gmail.com]
Sent: Friday, December 14, 2007 5:31 PM
To: Jonathan Pierce
Cc: m...@sdsc.edu; Neurology LONI SysAdm; teje...@chem.northwestern.edu; Hung-Sh...@sun.com; npaci-rocks...@sdsc.edu
Subject: Re: [Rocks-Discuss] SGE can't get password entry for users
What version of Rocks are you running?
Prior to 4.3, there is a race in the 411get code, see
https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2007-April/024839.html
that could exhibit itself with the type of behaviour that you are seeing.
-P
On Dec 14, 2007 4:12 PM, Jonathan Pierce <jonatha...@loni.ucla.edu<mailto:jonatha...@loni.ucla.edu>> wrote:
Hi Mason,
As a note of clarification, we have an application we use to submit
jobs to our cluster -- it submits the job as an account defined for
the app, then before executing performs an su to the real user's
account. The issue we are experiencing here (as reported by the GE
error) isn't with new accounts, but rather with a single account that
is known at a certain point to exist in the passwd file but (randomly)
seems to disappear. During the 411get, is there ever a period where
the passwd file is either empty or resets to a default?
Thank you very much,
Jonathan
Jonathan Pierce
Systems Administrator
Laboratory of Neuro Imaging, UCLA
635 Charles E. Young Drive South,
Suite 225 Los Angeles, CA 90095-7334
Tel: 310.267.5076
Cell: 310.487.8365
Fax: 310.206.5518
jonatha...@loni.ucla.edu<mailto:jonatha...@loni.ucla.edu>
On Dec 14, 2007, at 10:39 AM, Mason J. Katz wrote:
> On Rocks 4.3 you can just run.
>
> # rocks sync users
>
> After adding (or modifying) any user account. You must do this after
> adding any new user accounts.
>
>
> On Dec 13, 2007 6:59 PM, Hugo R. Hernandez-Mora
>> hugo.he...@loni.ucla.edu<mailto:hugo.he...@loni.ucla.edu>
Does dbreport have the facilities to output the passwd file? I'm
unable to find that option in 4.3. Regardless, though, this should be
a non-issue. Our script operates as such:
ypcat passwd > /etc/passwd.nis
cat /etc/passwd.local /etc/passwd.nis > /etc/passwd.combined
mv /etc/passwd.combined /etc/passwd
Similarly for the other necessary NIS maps. The issue here, however,
doesn't seem to lie with the frontend, as we've never experienced nor
heard complaints about an inability to login to that machine. I would
guess there's something happening on the back with the nodes, but I'm
still trying to find evidence to substantiate this.
And in answer to your other question, no, the machine's default
behavior is not to rebuild after reboot. Thank you for taking the
time to help troubleshoot this.
Sincerely,
Jonathan
Jonathan Pierce
Systems Administrator
Laboratory of Neuro Imaging, UCLA
635 Charles E. Young Drive South,
Suite 225 Los Angeles, CA 90095-7334
Tel: 310.267.5076
Cell: 310.487.8365
Fax: 310.206.5518
jonatha...@loni.ucla.edu
On Dec 15, 2007, at 1:23 PM, LaoTsao(Dr. Tsao) wrote:
> hi
> IMHO, U did not follow the rock's rule
> Ur /etc/passwd file did not store into the rock's database
> anytime dbreport reset the /etc/passwd it just go back to default
> e.g. U did not use modified useradd to add the user to the rocks
> system
> my 2c
> --
> Hung-Sheng Tsao, Ph.D. (LaoTsao) Sr. System Engineer
> US, GEH East Data Center Ambassador
> 400 Atrium Dr, 1ST FLOOR P/F:1877 319 0460 (x67079)
> Somerset, NJ 08873 C: 973 495 0840
> http://blogs.sun.com/hstsao/ E:Hung-Sh...@sun.com
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> NOTICE: This email message is for the sole use of the intended
> recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or
> distribution is prohibited. If you are not the intended
> recipient, please contact the sender by reply email and destroy
> all copies of the original message.
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
Shawn
#!/bin/sh
# Changelog
# 10/9/07 - Added support for a direct ypcat that does not require the frontend
# to be bound to NIS - Shawn
# Before using this script, copy /etc/passwd to /etc/passwd.local, /etc/group to
# /etc/group.local and /etc/hosts to /etc/hosts.local. Be careful when using this
# configuration, especially when adding users or groups, especially using the Rocks
# modified useradd.
# These need to be uncommented and modified for your site's NIS server
# Use the IP address of the NIS master for this so we're sure to get
# authoritative info
#export dom=nis.domain.name
#export ypserv=111.222.333.444
# Check to make sure we can talk to the NIS server
ypcat -d $dom -h $ypserv auto.master > /dev/null 2>&1
if [ "$?" = "0" ]; then
true
else
echo "Can't talk to NIS server $ypserv"
exit 1
fi
# Let's gather up all of the automount maps
ypcat -d $dom -h $ypserv -k auto.master 2> /dev/null | grep auto. | egrep -iv 'direct|home' |sed -e '/^#/d' -e '/^$/d' | while read
dir map options
do
ypcat -d $dom -h $ypserv -k "$map" > /etc/$map
/opt/rocks/sbin/411put /etc/$map > /dev/null 2>&1
done
# Here we grab the good stuff out of auto.master
ypcat -d $dom -h $ypserv -k auto.master 2> /dev/null | grep auto. | grep -v 'direct' |sed -e '/^#/d' -e '/^$/d' | while read dir map
options
do
echo "$dir /etc/$map $options" >> /etc/auto.master.tmp
Done
# You can add in another map, like we NFS export /state/partition1 on all the nodes
# You'll need to create an automount map on the frontend to facilitate it
#echo "/clustmp /etc/auto.clustmp -rw,nosuid,intr" >> /etc/auto.master.tmp
#/opt/rocks/sbin/411put /etc/auto.clustmp > /dev/null 2>&1
cp /etc/auto.master.tmp /etc/auto.master
rm /etc/auto.master.tmp
/opt/rocks/sbin/411put /etc/auto.master > /dev/null 2>&1
# Make auto.home, and hack in /home/install
ypcat -d $dom -h $ypserv -k auto.home > /etc/auto.home
echo "install 10.1.1.1:/export/home/install" >> /etc/auto.home
/opt/rocks/sbin/411put /etc/auto.home > /dev/null 2>&1
ypcat -d $dom -h $ypserv passwd > /etc/passwd.nis
cat /etc/passwd.local /etc/passwd.nis > /etc/passwd.combined
cp /etc/passwd.combined /etc/passwd
/opt/rocks/sbin/411put /etc/passwd > /dev/null 2>&1
ypcat -d $dom -h $ypserv group > /etc/group.nis
cat /etc/group.local /etc/group.nis > /etc/group.combined
cp /etc/group.combined /etc/group
/opt/rocks/sbin/411put /etc/group > /dev/null 2>&1
cp /etc/hosts /etc/hosts.old
ypcat -d $dom -h $ypserv hosts > /etc/hosts.nis
/opt/rocks/bin/dbreport hosts > /etc/hosts.dbreport
cat /etc/hosts.dbreport /etc/hosts.nis > /etc/hosts
/opt/rocks/sbin/411put /etc/hosts > /dev/null 2>&1
We manage a cluster that used to use an external NIS
server (we use an external LDAP server now). I'll see
if I can find the old NIS/411 stuff.
Ian Kaufman
Research Systems Administrator
Jacobs School of Engineering
ikau...@soe.ucsd.edu x49716
Sorry that I haven't followed up on this issue. Just to put a close
to it, there was a problem in retrieving the passwd map through 'ypcat
passwd'. The stock passwd file was then combined with an empty
passwd.nis file, which was then propagated through 411, so the issue
was on our end. Thank you very much to everybody who offered their
assistance in fixing this.
Sincerely,
Jonathan
Jonathan Pierce
Systems Administrator
Laboratory of Neuro Imaging, UCLA
635 Charles E. Young Drive South,
Suite 225 Los Angeles, CA 90095-7334
Tel: 310.267.5076
Cell: 310.487.8365
Fax: 310.206.5518
jonatha...@loni.ucla.edu