Puppet client on Solaris 10

333 views
Skip to first unread message

Silflay Hraka

unread,
Jun 25, 2008, 1:28:27 PM6/25/08
to puppet...@googlegroups.com
Has anyone gotten the latest puppet client to run on Solaris 10?  I've installed it with no problems via blastwave, and it starts up and connects with the puppetmaster server with no problem.  However, after about 12 hours of running smoothly, all attempts to connect to the puppetmaster server fail, and leave this error in the syslog;

"Could not find server munged.net.unc.edu: getaddrinfo: node name or service name not known"

In addition, all user logins to the server the client is running on start failing at pretty much the same time, include that of the root user from a console connection.  The server has to be rebooted before logins are successful again.  The errors written imply that puppet is somehow interfering with the pam.conf or our Kerberos connection in some manner, though none of those files are touched by puppet;

Jun 24 09:16:10 munged.oit.unc.edu sshd[5261]: [ID 603599 auth.crit] pam_krb5afs: authenticate error: Cannot contact any KDC for req
uested realm (-1765328228)
Jun 24 09:16:10 eyepop.oit.unc.edu sshd[5261]: [ID 603599 auth.info] pam_krb5afs: authentication fails for `somepoorsob'

Has anyone experienced anything like the above?

I've tried to work around the login issue by calling puppet twice an hour via cron, but ruby dumps a core file on me everytime.

None of these problems occur on my Linux and Solaris 9 boxes running puppet.  Nor do they experience any difficuties connecting during the times when the puppet client begins to fail on my Sol 10 servers.

Thanks for any help,

Sid Stafford
its-networking
UNC-CH

Luke Kanies

unread,
Jun 25, 2008, 6:59:03 PM6/25/08
to puppet...@googlegroups.com

I know that some people are using Puppet on Solaris 10, and I've not
seen this before, but I don't specifically know that any Solaris 10
users are also using Kerberos, although some Puppet users definitely
are.

Definitely seems pretty strange, and those segfaults imply that that
Ruby build is maybe not that great. Are you using the blastwave ruby,
or the Solaris build?

--
The surest sign that intelligent life exists elsewhere in the universe
is that it has never tried to contact us.
--Calvin and Hobbes (Bill Watterson)
---------------------------------------------------------------------
Luke Kanies | http://reductivelabs.com | http://madstop.com

Vipul Ramani

unread,
Jun 25, 2008, 8:26:35 PM6/25/08
to Puppet Users
Hi Silflay Hraka ,

i m using latest puppet client and server in solaris 10 x86 it is
working perfectly .... I not using keberos but i m using Sun One+ LDAP
authentication via pam with netgroup environment

"Could not find server munged.net.unc.edu: getaddrinfo: node name or
service
name not known"

Can you check or ping from your puppet client munged.net.unc.edu .. or
add in ./etc/hosts file munged.net.unc.edu IP_ADDRESS and then
try ..

Regards
Vipul Ramani

Russ Allbery

unread,
Jun 25, 2008, 8:50:27 PM6/25/08
to puppet...@googlegroups.com
Luke Kanies <lu...@madstop.com> writes:

>> Jun 24 09:16:10 munged.oit.unc.edu sshd[5261]: [ID 603599 auth.crit]
>> pam_krb5afs: authenticate error: Cannot contact any KDC for req
>> uested realm (-1765328228)

This error message means that your krb5.conf file doesn't have any realm
information for your local realm, you don't have DNS records for your
local realm, or the DNS records are not resolving properly. In other
words, it's either a krb5.conf problem or a DNS name resolution problem
(possibly /etc/resolv.conf). I suspect one of those files has been
corrupted somehow. nsswitch.conf is another outside possibility.

--
Russ Allbery (r...@stanford.edu) <http://www.eyrie.org/~eagle/>

Silflay Hraka

unread,
Jun 27, 2008, 9:40:59 AM6/27/08
to puppet...@googlegroups.com
Thanks Luke.

We're using the blastwave ruby--figuring that getting all the packages from one place would reduce the incompatibilites. I'll try building everything myself on the next go-round.

--Sid

Silflay Hraka

unread,
Jun 27, 2008, 9:43:25 AM6/27/08
to puppet...@googlegroups.com
Thanks Russ.

I would agree that I've got corrupt files somewhere, but if we reboot the server after the logins start failing, then all  is well until the next outage.  When I've compared those files to non-puppet Solaris 10 server, there are no differences.

--Sid

Russ Allbery

unread,
Jun 27, 2008, 4:35:56 PM6/27/08
to puppet...@googlegroups.com
"Silflay Hraka" <sil...@gmail.com> writes:

> I would agree that I've got corrupt files somewhere, but if we reboot the
> server after the logins start failing, then all is well until the next
> outage. When I've compared those files to non-puppet Solaris 10 server,
> there are no differences.

Oh, hm, that would point to nscd. Are you running nscd on your system,
and if so, do you have it configured to cache DNS entries? We had no end
of trouble with nscd on our Solaris systems to the point that we forcibly
disabled it everywhere, but that experience dates from Solaris 8 and 9.

hvm

unread,
Jul 4, 2008, 11:39:51 AM7/4/08
to Puppet Users
> Oh, hm, that would point to nscd. Are you running nscd on your system,
> and if so, do you have it configured to cache DNS entries? We had no end
> of trouble with nscd on our Solaris systems to the point that we forcibly
> disabled it everywhere, but that experience dates from Solaris 8 and 9.

We've just started using Puppet (current Blastwave version, Solaris
10u5 x86) and we're having similar issues on this particular system:
after a few days or even hours maybe, the system can't resolve hosts
or even users. Invalidating caches through nscd -i doesn't help, but
restarting svc:/system/name-service-cache does. We've never had any
nscd issues before, so I'm curious how its cache is affected here. Any
ideas? Thanks in advance for your insights.

Hans van der Made
Utrecht University
NL

hvm

unread,
Jul 7, 2008, 10:25:32 AM7/7/08
to Puppet Users


On Jul 4, 5:39 pm, hvm <chitchat...@gmail.com> wrote:
> > Oh, hm, that would point to nscd. Are you running nscd on your system,
> > and if so, do you have it configured to cache DNS entries? We had no end

My /var/adm/nscd.log shows messages like these:

res_init: socket: Too many open files
res_init: socket: Too many open files
res_init: socket: Too many open files
res_init: socket: Too many open files

This might explain nscd's seemingly random behaviour. Now let's see
if we can find the real culprit :)

hvm

unread,
Jul 8, 2008, 6:38:18 AM7/8/08
to Puppet Users

> My /var/adm/nscd.log shows messages like these:
>
> res_init: socket: Too many open files

Don't know if anyone is still reading, but I've got some additional
info:

* every puppet run has only one change:

debug: /puppetconfig/main/File[/var/puppet/state/state.yaml]: 1
change(s)
debug: /puppetconfig/main/File[/var/puppet/state/state.yaml]/mode:
mode changed '640' to '660'

* every puppet run, either through "kill -SIGUSR1 `pgrep puppetd`" or
run from the command line with --debug, results in a few extra
instances of /etc/group in the lsof output

I've disabled the puppetd service to see if this count decreases over
time, will keep you posted.

Marcin Owsiany

unread,
Jul 8, 2008, 11:08:11 AM7/8/08
to puppet...@googlegroups.com

I _think_ someone mentioned to me once that there is a bug in nscd that
causes it to leak FDs.

Marcin
--
Marcin Owsiany <mar...@owsiany.pl> http://marcin.owsiany.pl/
GnuPG: 1024D/60F41216 FE67 DA2D 0ACA FC5E 3F75 D6F6 3A0D 8AA0 60F4 1216

"Every program in development at MIT expands until it can read mail."
-- Unknown

Silflay Hraka

unread,
Jul 8, 2008, 11:12:23 AM7/8/08
to puppet...@googlegroups.com
I'm not ready to pronounce the issue solved quite yet, but since we've disabled svc:/system/name-service-cache on the target system, we've gone a record amount of time without the login problems recurring.  If the system behaves for another couple of days, then I'll consider it resolved.

--Sid

hvm

unread,
Jul 9, 2008, 9:45:34 AM7/9/08
to Puppet Users
@Marcin: I've seen this bug described, but this was a particular LDAP
issue, I believe

@Silflay: I'd rather find and tackle the issue, instead of this
workaround. All our other systems run just fine with nscd running, and
the FD count only increases with puppetd runs:

nscd 15008 root cwd VDIR 102,0
1024 2 /
nscd 15008 root txt VREG 102,0 157980
77454 /usr/sbin/nscd
nscd 15008 root txt VREG 102,0 18672
216087 /lib/nss_user.so.1
nscd 15008 root txt VREG 102,0 33172
216083 /lib/nss_dns.so.1
nscd 15008 root txt VREG 102,0 58288
216084 /lib/nss_files.so.1
nscd 15008 root txt VREG 102,0 75560
216052 /lib/libmd.so.1
nscd 15008 root txt VREG 102,0 19616
216055 /lib/libmp.so.2
nscd 15008 root txt VREG 102,0 61740
216085 /lib/nss_nis.so.1
nscd 15008 root txt VREG 102,0 38672
216041 /lib/libgen.so.1
nscd 15008 root txt VREG 102,0 38920
216079 /lib/libuutil.so.1
nscd 15008 root txt VREG 102,0 1079912
46622 /usr/lib/libc/libc_hwcap1.so.1
nscd 15008 root txt VREG 102,0 9544
216022 /lib/libavl.so.1
nscd 15008 root txt VREG 102,0 121536
216066 /lib/libscf.so.1
nscd 15008 root txt VREG 102,0 129452
216077 /lib/libumem.so.1
nscd 15008 root txt VREG 102,0 83352
216072 /lib/libsocket.so.1
nscd 15008 root txt VREG 102,0 15924
216038 /lib/libdoor.so.1
nscd 15008 root txt VREG 102,0 728064
216056 /lib/libnsl.so.1
nscd 15008 root txt VREG 102,0 287512
216062 /lib/libresolv.so.2
nscd 15008 root txt VREG 102,0 213704
216019 /lib/ld.so.1
nscd 15008 root 0u VCHR 13,2
6815752 /devices/pseudo/mm@0:null
nscd 15008 root 1u VCHR 13,2
6815752 /devices/pseudo/mm@0:null
nscd 15008 root 2w VREG 102,0 1064241
22582 / (/dev/dsk/c1d0s0)
nscd 15008 root 3u DOOR 0,3557 0t0
3557 (this PID's door)
nscd 15008 root 4u sock
0t0 AF_ROUTE, SOCK_RAW
nscd 15008 root 5r VREG 102,0 327
212090 /etc/group
nscd 15008 root 6r VREG 102,0 327
212090 /etc/group
nscd 15008 root 7r VREG 102,0 327
212090 /etc/group

[lots and lots of references to /etc/group removed]

root@ultra20:~# lsof | grep "^nscd" | grep /etc/group | wc -l
140

Regards,

Hans
NL

Matt McLeod

unread,
Jul 13, 2008, 10:42:06 PM7/13/08
to puppet...@googlegroups.com
I’ve had problems with nscd for as long as I can remember, but the standard build at my current workplace included leaving it on, so I got to see first-hand just how badly it interacts with Puppet.

Prior to this it was also giving us trouble on a ClearCase server.  A support case lodged with Sun was resolved with “disable nscd, then” rather than any indication that they accept there’s some problem with it.

We’re now disabling nscd via puppet on all hosts, which has cleared up pretty much all name resolution problems we’d been having.  Fortunately we’re not using LDAP for user account management so disabling nscd isn’t a big problem.

Matt

Russ Allbery

unread,
Jul 14, 2008, 1:14:21 AM7/14/08
to puppet...@googlegroups.com
Matt McLeod <matt....@itg.com> writes:

> We¹re now disabling nscd via puppet on all hosts, which has cleared up
> pretty much all name resolution problems we¹d been having. Fortunately
> we¹re not using LDAP for user account management so disabling nscd isn¹t
> a big problem.

I believe you can configure nscd to only cache users and not cache hosts
if you need to run it for LDAP. It's the DNS cache that usually broke
things for us.

hvm

unread,
Jul 15, 2008, 8:09:19 PM7/15/08
to Puppet Users
Hi Matt,

> We¹re now disabling nscd via puppet on all hosts, which has cleared up
> pretty much all name resolution problems we¹d been having. Fortunately

Same thing here. I have checked both Puppet and nscd documentation and
could not find anything relevant. Solaris appears to be a well-
supported platform, so I'm a bit surprised the issue isn't documented.
Nevertheless, I'd like to express my gratitude to the Puppet
developers for their hard work. Tools like these are hard to find.

martin

unread,
Jul 18, 2008, 6:21:13 PM7/18/08
to Puppet Users
Russ,

On Jul 14, 2:14 am, Russ Allbery <r...@stanford.edu> wrote:

> I believe you can configure nscd to only cache users and not cache hosts
> if you need to run it for LDAP. It's the DNS cache that usually broke
> things for us.
>
you''re right, to tell nscd to not cache host information, edit /etc/
nscd.conf and change the default values:
positive-time-to-live hosts 3600
negative-time-to-live hosts 5
keep-hot-count hosts 20
check-files hosts yes

to something more appropriate or just remove the hosts entries.

cheers,
/Martin
Reply all
Reply to author
Forward
0 new messages