[Rocks-Discuss] SGE can't get password entry for users

1,403 views
Skip to first unread message

Hugo R. Hernandez-Mora

unread,
Dec 13, 2007, 9:59:34 PM12/13/07
to npaci-rocks-dis...@sdsc.edu, teje...@chem.northwestern.edu, Neurology LONI SysAdm, Hung-Sh...@sun.com, npaci-rocks...@sdsc.edu
Hi All,
we are having an issue which reports SGE can't get the password entry
for users. This is occurring randomly (we guess so) in different
compute nodes of our Rocks cluster. We are are using Rocks v4.3 + SGE
6.1u2 and intermittently we are receiving the following error on the
qmaster message file:

12/13/2007 15:18:42|qmaster|mgmtHost|W|job 854800.1 failed on host
compute-1-1.local general assumedly before job because: can't get
password entry for user "username". Either the user does not exist
or NIS error!

I found a message on this mailing list dated 9/27/2005 which suggests to
fix this problem by retrieving the latest files from the frontend by
using the 411get command as:

# cluster-fork --node="compute-1-1.local" 411get --all

We are so confused with this error and not sure if it is related with
the fact we are increasing the number of compute nodes on the cluster.
I say that because the frequency this problem occurs increased since two
weeks ago when we added the actual 25% of the nodes into the cluster
(actually we have 185 dual-processor V20z).

As a temporary solutions (we are monitoring if it already works) was to
create a cron running each 10 minutes to retrieve the latest files front
the frontend and decrease the polling interval to 5 minutes, but we are
looking for a final solution to fix this problem. It is possible this
issue has been solved or is it a SGE bug? a Rocks bug? Any help would
be greatly appreciated.

Best,
- Hugo


--
Hugo R. Hernandez-Mora
System Administrator
Laboratory of Neuro Imaging, UCLA
635 Charles E. Young Drive South, Suite 225
Los Angeles, CA 90095-7332
Tel: 310.267.5076
Fax: 310.206.5518
hugo.he...@loni.ucla.edu
--

"Si seus esforços, foram vistos com indefrença, não desanime,
que o sol faze un espectacolo maravilhoso todas as manhãs
cuando a maior parte das pessoas, ainda estam durmindo"

-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20071213/cadfb5c1/attachment.html

Mason J. Katz

unread,
Dec 14, 2007, 1:39:36 PM12/14/07
to Hugo R. Hernandez-Mora, Neurology LONI SysAdm, npaci-rocks-dis...@sdsc.edu, teje...@chem.northwestern.edu, Hung-Sh...@sun.com, npaci-rocks...@sdsc.edu
On Rocks 4.3 you can just run.

# rocks sync users

After adding (or modifying) any user account. You must do this after
adding any new user accounts.


On Dec 13, 2007 6:59 PM, Hugo R. Hernandez-Mora

--
-mjk

Jonathan Pierce

unread,
Dec 14, 2007, 7:12:10 PM12/14/07
to m...@sdsc.edu, Neurology LONI SysAdm, teje...@chem.northwestern.edu, Hung-Sh...@sun.com, npaci-rocks...@sdsc.edu
Hi Mason,

As a note of clarification, we have an application we use to submit
jobs to our cluster -- it submits the job as an account defined for
the app, then before executing performs an su to the real user's
account. The issue we are experiencing here (as reported by the GE
error) isn't with new accounts, but rather with a single account that
is known at a certain point to exist in the passwd file but (randomly)
seems to disappear. During the 411get, is there ever a period where
the passwd file is either empty or resets to a default?

Thank you very much,
Jonathan

Jonathan Pierce
Systems Administrator


Laboratory of Neuro Imaging, UCLA
635 Charles E. Young Drive South,

Suite 225 Los Angeles, CA 90095-7334
Tel: 310.267.5076
Cell: 310.487.8365
Fax: 310.206.5518
jonatha...@loni.ucla.edu

Philip Papadopoulos

unread,
Dec 14, 2007, 8:31:35 PM12/14/07
to Jonathan Pierce, teje...@chem.northwestern.edu, npaci-rocks...@sdsc.edu, Neurology LONI SysAdm, m...@sdsc.edu, Hung-Sh...@sun.com
What version of Rocks are you running?
Prior to 4.3, there is a race in the 411get code, see
https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2007-April/024839.html
that could exhibit itself with the type of behaviour that you are seeing.

-P

On Dec 14, 2007 4:12 PM, Jonathan Pierce <jonatha...@loni.ucla.edu>
wrote:


--
Philip Papadopoulos, PhD
University of California, San Diego
858-822-3628


-------------- next part --------------
An HTML attachment was scrubbed...

URL: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20071214/54c112bf/attachment.html

Mason J. Katz

unread,
Dec 15, 2007, 2:15:06 AM12/15/07
to Jonathan Pierce, Neurology LONI SysAdm, teje...@chem.northwestern.edu, Hung-Sh...@sun.com, npaci-rocks...@sdsc.edu
If the passwd file is every missing this is a serious bug, we have not
seen this before. Are you using an external NIS or LDAP server here,
or is this really stock Rocks?

On Dec 14, 2007 4:12 PM, Jonathan Pierce <jonatha...@loni.ucla.edu> wrote:

--
-mjk

Hugo Hernandez-Mora

unread,
Dec 15, 2007, 10:08:32 AM12/15/07
to Philip Papadopoulos, Jonathan Pierce, teje...@chem.northwestern.edu, npaci-rocks...@sdsc.edu, Neurology LONI SysAdm, m...@sdsc.edu, Hung-Sh...@sun.com
Philip,
we are running Rocks v4.3 _ SGE v6.1u2.
- Hugo

--

Hugo R. Hernandez-Mora, M.Sc.


System Administrator
Laboratory of Neuro Imaging, UCLA
635 Charles E. Young Drive South, Suite 225
Los Angeles, CA 90095-7332
Tel: 310.267.5076
Fax: 310.206.5518
hugo.he...@loni.ucla.edu

--

"Si seus esforços, foram vistos com indefrença, não desanime,
que o sol faze un espectacolo maravilhoso todas as manhãs
cuando a maior parte das pessoas, ainda estam durmindo"

________________________________________
From: Philip Papadopoulos [philip.pa...@gmail.com]
Sent: Friday, December 14, 2007 5:31 PM
To: Jonathan Pierce
Cc: m...@sdsc.edu; Neurology LONI SysAdm; teje...@chem.northwestern.edu; Hung-Sh...@sun.com; npaci-rocks...@sdsc.edu
Subject: Re: [Rocks-Discuss] SGE can't get password entry for users

What version of Rocks are you running?
Prior to 4.3, there is a race in the 411get code, see
https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/2007-April/024839.html
that could exhibit itself with the type of behaviour that you are seeing.

-P

On Dec 14, 2007 4:12 PM, Jonathan Pierce <jonatha...@loni.ucla.edu<mailto:jonatha...@loni.ucla.edu>> wrote:
Hi Mason,

As a note of clarification, we have an application we use to submit
jobs to our cluster -- it submits the job as an account defined for
the app, then before executing performs an su to the real user's
account. The issue we are experiencing here (as reported by the GE
error) isn't with new accounts, but rather with a single account that
is known at a certain point to exist in the passwd file but (randomly)
seems to disappear. During the 411get, is there ever a period where
the passwd file is either empty or resets to a default?

Thank you very much,
Jonathan

Jonathan Pierce
Systems Administrator
Laboratory of Neuro Imaging, UCLA
635 Charles E. Young Drive South,
Suite 225 Los Angeles, CA 90095-7334
Tel: 310.267.5076
Cell: 310.487.8365
Fax: 310.206.5518

jonatha...@loni.ucla.edu<mailto:jonatha...@loni.ucla.edu>

On Dec 14, 2007, at 10:39 AM, Mason J. Katz wrote:

> On Rocks 4.3 you can just run.
>
> # rocks sync users
>
> After adding (or modifying) any user account. You must do this after
> adding any new user accounts.
>
>
> On Dec 13, 2007 6:59 PM, Hugo R. Hernandez-Mora

>> hugo.he...@loni.ucla.edu<mailto:hugo.he...@loni.ucla.edu>

Rico Magsipoc

unread,
Dec 15, 2007, 4:04:46 PM12/15/07
to m...@sdsc.edu, Jonathan Pierce, Neurology LONI SysAdm, teje...@chem.northwestern.edu, Hung-Sh...@sun.com, npaci-rocks...@sdsc.edu
The headnode is a NIS client that ypcats the NIS user information to its /etc/passwd file via cron, then distibutes this information to the compute nodes via 411. The compute nodes are stock ROCKS.

On the problematic compute nodes, the /etc/passwd file does not actually disappear. It resets to the default, without any NIS information. Hence, the authentication issues with NIS users. We assumed that the headnode's /etc/passwd file may be getting clobbered. However, looking at the cronjob, it simply appends to the headnode's passwd file. So the question is where and how do the problematic compute nodes get their /etc/passwd file reset to default? Any light you can shed on this issue is appreciated.



.................................
Rico Magsipoc, MBA
Chief Technology Officer
Laboratory of Neuro Imaging, UCLA
635 Charles E. Young Drive South, Suite 225
Los Angeles, CA 90095-7332
Tel: 310.267.5076
Fax: 310.206.5518
Cell: 310.467.8730
rico.m...@loni.ucla.edu
................................

----- Original Message -----
From: Mason J. Katz <mason...@gmail.com>
To: Jonathan Pierce
Cc: Hugo Hernandez-Mora; teje...@chem.northwestern.edu <teje...@chem.northwestern.edu>; Neurology LONI SysAdm; Hung-Sh...@sun.com <Hung-Sh...@sun.com>; npaci-rocks...@sdsc.edu <npaci-rocks...@sdsc.edu>
Sent: Fri Dec 14 23:15:06 2007
Subject: Re: [Rocks-Discuss] SGE can't get password entry for users

If the passwd file is every missing this is a serious bug, we have not
seen this before. Are you using an external NIS or LDAP server here,
or is this really stock Rocks?

On Dec 14, 2007 4:12 PM, Jonathan Pierce <jonatha...@loni.ucla.edu> wrote:
> Hi Mason,
>
> As a note of clarification, we have an application we use to submit
> jobs to our cluster -- it submits the job as an account defined for
> the app, then before executing performs an su to the real user's
> account. The issue we are experiencing here (as reported by the GE
> error) isn't with new accounts, but rather with a single account that
> is known at a certain point to exist in the passwd file but (randomly)
> seems to disappear. During the 411get, is there ever a period where
> the passwd file is either empty or resets to a default?
>
> Thank you very much,
> Jonathan
>
> Jonathan Pierce
> Systems Administrator
> Laboratory of Neuro Imaging, UCLA
> 635 Charles E. Young Drive South,
> Suite 225 Los Angeles, CA 90095-7334
> Tel: 310.267.5076
> Cell: 310.487.8365
> Fax: 310.206.5518
> jonatha...@loni.ucla.edu
>
>
>
>
> On Dec 14, 2007, at 10:39 AM, Mason J. Katz wrote:
>
> > On Rocks 4.3 you can just run.
> >
> > # rocks sync users
> >
> > After adding (or modifying) any user account. You must do this after
> > adding any new user accounts.
> >
> >
> > On Dec 13, 2007 6:59 PM, Hugo R. Hernandez-Mora
> >> --
> >>
> >> "Si seus esforços, foram vistos com indefrença, não desanime,
> >> que o sol faze un espectacolo maravilhoso todas as manhãs
> >> cuando a maior parte das pessoas, ainda estam durmindo"
> >>
> >> -------------- next part --------------
> >> An HTML attachment was scrubbed...
> >> URL: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20071213/cadfb5c1/attachment.html
> >>
> >
> >
> >
> > --
> > -mjk
>
>



--
-mjk

Jonathan Pierce

unread,
Dec 17, 2007, 9:38:16 PM12/17/07
to LaoTsao, Rico Magsipoc, Hu...@sdsc.edu, m...@sdsc.edu, npaci-rocks...@sdsc.edu, Philip Papadopoulos
Hi Dr. Tsao,

Does dbreport have the facilities to output the passwd file? I'm
unable to find that option in 4.3. Regardless, though, this should be
a non-issue. Our script operates as such:

ypcat passwd > /etc/passwd.nis
cat /etc/passwd.local /etc/passwd.nis > /etc/passwd.combined
mv /etc/passwd.combined /etc/passwd

Similarly for the other necessary NIS maps. The issue here, however,
doesn't seem to lie with the frontend, as we've never experienced nor
heard complaints about an inability to login to that machine. I would
guess there's something happening on the back with the nodes, but I'm
still trying to find evidence to substantiate this.

And in answer to your other question, no, the machine's default
behavior is not to rebuild after reboot. Thank you for taking the
time to help troubleshoot this.

Sincerely,
Jonathan

Jonathan Pierce
Systems Administrator
Laboratory of Neuro Imaging, UCLA
635 Charles E. Young Drive South,
Suite 225 Los Angeles, CA 90095-7334
Tel: 310.267.5076
Cell: 310.487.8365
Fax: 310.206.5518
jonatha...@loni.ucla.edu

On Dec 15, 2007, at 1:23 PM, LaoTsao(Dr. Tsao) wrote:

> hi
> IMHO, U did not follow the rock's rule
> Ur /etc/passwd file did not store into the rock's database
> anytime dbreport reset the /etc/passwd it just go back to default
> e.g. U did not use modified useradd to add the user to the rocks
> system
> my 2c

> --
> Hung-Sheng Tsao, Ph.D. (LaoTsao) Sr. System Engineer
> US, GEH East Data Center Ambassador
> 400 Atrium Dr, 1ST FLOOR P/F:1877 319 0460 (x67079)
> Somerset, NJ 08873 C: 973 495 0840
> http://blogs.sun.com/hstsao/ E:Hung-Sh...@sun.com
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> NOTICE: This email message is for the sole use of the intended
> recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or
> distribution is prohibited. If you are not the intended
> recipient, please contact the sender by reply email and destroy
> all copies of the original message.
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>

Stephens, Shawn H (SAIC)

unread,
Dec 17, 2007, 10:51:34 PM12/17/07
to Jonathan Pierce, LaoTsao, npaci-rocks...@sdsc.edu
Below is our NIS to 411 script that we use in Rocks 3.3. I'm pretty sure most of it will still work in recent versions of Rocks. We call it yp-to-411, and run it in /etc/cron.hourly on the frontend node.

Shawn

#!/bin/sh
# Changelog
# 10/9/07 - Added support for a direct ypcat that does not require the frontend
# to be bound to NIS - Shawn

# Before using this script, copy /etc/passwd to /etc/passwd.local, /etc/group to
# /etc/group.local and /etc/hosts to /etc/hosts.local. Be careful when using this
# configuration, especially when adding users or groups, especially using the Rocks
# modified useradd.

# These need to be uncommented and modified for your site's NIS server
# Use the IP address of the NIS master for this so we're sure to get
# authoritative info
#export dom=nis.domain.name
#export ypserv=111.222.333.444

# Check to make sure we can talk to the NIS server
ypcat -d $dom -h $ypserv auto.master > /dev/null 2>&1
if [ "$?" = "0" ]; then
true
else
echo "Can't talk to NIS server $ypserv"
exit 1
fi

# Let's gather up all of the automount maps
ypcat -d $dom -h $ypserv -k auto.master 2> /dev/null | grep auto. | egrep -iv 'direct|home' |sed -e '/^#/d' -e '/^$/d' | while read
dir map options
do
ypcat -d $dom -h $ypserv -k "$map" > /etc/$map
/opt/rocks/sbin/411put /etc/$map > /dev/null 2>&1
done

# Here we grab the good stuff out of auto.master
ypcat -d $dom -h $ypserv -k auto.master 2> /dev/null | grep auto. | grep -v 'direct' |sed -e '/^#/d' -e '/^$/d' | while read dir map
options
do
echo "$dir /etc/$map $options" >> /etc/auto.master.tmp
Done
# You can add in another map, like we NFS export /state/partition1 on all the nodes
# You'll need to create an automount map on the frontend to facilitate it
#echo "/clustmp /etc/auto.clustmp -rw,nosuid,intr" >> /etc/auto.master.tmp
#/opt/rocks/sbin/411put /etc/auto.clustmp > /dev/null 2>&1
cp /etc/auto.master.tmp /etc/auto.master
rm /etc/auto.master.tmp
/opt/rocks/sbin/411put /etc/auto.master > /dev/null 2>&1

# Make auto.home, and hack in /home/install
ypcat -d $dom -h $ypserv -k auto.home > /etc/auto.home
echo "install 10.1.1.1:/export/home/install" >> /etc/auto.home
/opt/rocks/sbin/411put /etc/auto.home > /dev/null 2>&1

ypcat -d $dom -h $ypserv passwd > /etc/passwd.nis
cat /etc/passwd.local /etc/passwd.nis > /etc/passwd.combined
cp /etc/passwd.combined /etc/passwd
/opt/rocks/sbin/411put /etc/passwd > /dev/null 2>&1

ypcat -d $dom -h $ypserv group > /etc/group.nis
cat /etc/group.local /etc/group.nis > /etc/group.combined
cp /etc/group.combined /etc/group
/opt/rocks/sbin/411put /etc/group > /dev/null 2>&1

cp /etc/hosts /etc/hosts.old
ypcat -d $dom -h $ypserv hosts > /etc/hosts.nis
/opt/rocks/bin/dbreport hosts > /etc/hosts.dbreport
cat /etc/hosts.dbreport /etc/hosts.nis > /etc/hosts
/opt/rocks/sbin/411put /etc/hosts > /dev/null 2>&1

Kaufman, Ian

unread,
Dec 18, 2007, 11:59:55 AM12/18/07
to npaci-rocks...@sdsc.edu
Do you guys see any cron errors? You might want to
use the 411 Make system instead - placing the needed
files in /var/411/411.mk and running "make -C /var/411
force" as opposed to using /opt/rocks/sbin/411put.
It will do the same thing, but the make system seems
to do a bit more checking and tidying up.

We manage a cluster that used to use an external NIS
server (we use an external LDAP server now). I'll see
if I can find the old NIS/411 stuff.

Ian Kaufman
Research Systems Administrator
Jacobs School of Engineering
ikau...@soe.ucsd.edu x49716

Jonathan Pierce

unread,
Jan 7, 2008, 4:57:01 PM1/7/08
to m...@sdsc.edu, teje...@chem.northwestern.edu, LaoTsao(Dr. Tsao), npaci-rocks...@sdsc.edu
Hi All,

Sorry that I haven't followed up on this issue. Just to put a close
to it, there was a problem in retrieving the passwd map through 'ypcat
passwd'. The stock passwd file was then combined with an empty
passwd.nis file, which was then propagated through 411, so the issue
was on our end. Thank you very much to everybody who offered their
assistance in fixing this.

Sincerely,
Jonathan

Jonathan Pierce
Systems Administrator
Laboratory of Neuro Imaging, UCLA
635 Charles E. Young Drive South,
Suite 225 Los Angeles, CA 90095-7334
Tel: 310.267.5076
Cell: 310.487.8365
Fax: 310.206.5518
jonatha...@loni.ucla.edu

Reply all
Reply to author
Forward
0 new messages