Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

nscd, door_call, multi-threading

237 views
Skip to first unread message

Marc

unread,
May 2, 2005, 5:26:46 PM5/2/05
to
I recently ran across the problem mentionned in:
http://www.science.uva.nl/pub/solaris/solaris2/Q5.72.html
Let me copy it here:
"5.72) Processes hang in door_call(), hostname lookups hang.

The door_call() on /etc/.name_service_door, is a fast IPC mechanism used
to call the name service cache daemon.

Usually, nscd speeds things up. However, on systems that do a lot of DNS
lookups, all such lookups are single threaded through nscd. Nscd itself
is multi-threaded, but the resolver library uses one big global lock. On
such systems, performance is often best served by disabling the nscd host
cache by editing /etc/nscd.conf like this:

enable-cache hosts no
"

Has a multi-threaded (a real one) version of the resolver library been
written since this faq? A cache is usually mostly useful to speed things
up when you often ask the same thing. If this cache is bad when used a
lot, it makes it close to useless...

PS: I tried to look it up on Sun's websites, but I kept getting into
those "get a support contract" messages.

ba...@smaalders.net

unread,
May 2, 2005, 11:40:25 PM5/2/05
to
Libresolv has been MT capable for several releases (since Solaris 8 if
I remember correctly).

- Bart

Marc

unread,
May 3, 2005, 5:58:15 AM5/3/05
to
ba...@smaalders.net wrote:

> Libresolv has been MT capable for several releases (since Solaris 8 if
> I remember correctly).

Weird, I ran across this issue on solaris 9, and the workaround suggested
in the faq seems to have solved it. What am I missing here?

ba...@smaalders.net

unread,
May 4, 2005, 12:06:40 AM5/4/05
to

Not sure. This can be a very subtle problem, and may be also caused
by badly behaved proxy dns servers, round-robin problems (the nscd
can defeat round-robin dns load spreading).

If you can reproduce the problem, enabling logging on the nscd
at level 10 will give you a lot of info about the source of
the problem.

If these are external web sites causing the problem, send
a email with the web site details to me at work:

bart.sm...@sun.com

and I'll try to produce it. A few minutes w/ dtrace on Solaris 10
will show exactly what the problem is...

- Bart

Marc

unread,
May 4, 2005, 8:06:42 AM5/4/05
to
ba...@smaalders.net wrote:

> Not sure. This can be a very subtle problem, and may be also caused
> by badly behaved proxy dns servers, round-robin problems (the nscd
> can defeat round-robin dns load spreading).
>
> If you can reproduce the problem, enabling logging on the nscd
> at level 10 will give you a lot of info about the source of
> the problem.

I am not sure I will be able to do that, the machine involved is our main
server, and I am not the one taking the decisions, but I will see what I
can do.

> If these are external web sites causing the problem,

I don't know what *causes* the problem, but I know I noticed it when
launching pine (/var/mail is local) or prstat. Besides the fact that we
use nis+ (and this machine is the server), I don't know what other
information could be relevant.

Thank you for your answers.

Joe Seigh

unread,
May 4, 2005, 8:25:48 AM5/4/05
to

On Solaris 2.6 I ended up downloading a copy of bind 8 and using it
since it had a a reentrant interface without the static global res
object. If the new libresolv has a res_init (don't remember the exact
name) for thread local copies of res and has functions which take it
as an argument then it's reentrant. Otherwise use bind 8. Bind 9 (or whatever
it is now) is reentrant also but the api was totally changed and not
documented very well.

Note bind uses stdio to init res so careful if you are in a 32 bit environment
with a limit of 256 open FILE handles.


--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.

ba...@smaalders.net

unread,
May 4, 2005, 12:41:24 PM5/4/05
to

Now I'm confused. Does this happen with DNS or NIS+?

What does your nsswitch.conf files say?

What does pstack report while the process doing
the lookup is hanging (so we can determine which
name service is causing the problem). Note that
NIS+ does have issues with queries originating on
the server machine if I remember correctly, so it's
important to figure out which name service is actually
at fault here.

- Bart

Marc

unread,
May 4, 2005, 1:22:02 PM5/4/05
to
ba...@smaalders.net wrote:

> Now I'm confused. Does this happen with DNS or NIS+?

Ok, I am the one confused (and confusing, sorry) actually.
It probably is nis.

> What does your nsswitch.conf files say?

hosts: nis [NOTFOUND=return] files
(that is the line you were asking about, right?)
nsswitch.conf differs from nsswitch.nis only for passwd
where it says: compat.

> What does pstack report while the process doing
> the lookup is hanging (so we can determine which
> name service is causing the problem). Note that
> NIS+ does have issues with queries originating on
> the server machine if I remember correctly, so it's
> important to figure out which name service is actually
> at fault here.

I cannot try that now, as I explained, but I will try to.

Marc

unread,
May 5, 2005, 12:15:35 PM5/5/05
to
ba...@smaalders.net wrote:

> What does pstack report while the process doing
> the lookup is hanging (so we can determine which
> name service is causing the problem).

I don't know if that helps, but I trussed the programs I know used to
hang (pine and prstat at least, also most programs did have issues at
startup), and there are not that many door_call. If the line:
enable-cache hosts no
in /etc/nscd.conf does not change the calls made to door_call (I don't
know where the caching takes place), it might still be relevant. (I will
still try to follow your instructions when I get the chance)
Let's look:

prstat:
900: getdents(4, 0x100424EA8, 8192) = 0
900: getloadavg(0xFFFFFFFF7FFFF2B4, 3) = 3
900: open("/var/run/name_service_door", O_RDONLY) = 549
900: fcntl(549, F_SETFD, 0x00000001) = 0
900: door_info(549, 0xFFFFFFFF7F4C3660) = 0
900: door_call(549, 0xFFFFFFFF7FFFECD8) = 0
900: door_info(549, 0xFFFFFFFF7FFFED08) = 0
900: door_call(549, 0xFFFFFFFF7FFFECD8) = 0
(last 2 lines repeated about 30 times, which I guess is because I have a
terminal with 34 lines)
And that is all. But we did not check where it was hanging.

pine:
According to the person who did the truss and found out it was a
door_call hanging (the output of that truss is lost), the door_call
should be one of those in the block where there are 3 in a row (not
sure though).

[...]
26362: time() = 1115303641
26362: getuid() = 10943 [10943]
26362: open64("/var/run/name_service_door", O_RDONLY) = 3
26362: fcntl(3, F_SETFD, 0x00000001) = 0
26362: door_info(3, 0xFF242748) = 0
26362: door_call(3, 0xFFBFEC80) = 0
26362: alarm(0) = 0
x4
26362: ioctl(0, TCGETA, 0xFFBFF1AC) = 0
26362: alarm(0) = 0
x5
26362: access("/users/00/maths/glisse/.pinercex", 0) Err#2 ENOENT
26362: stat("/etc/cram-md5.pwd", 0xFFBFF0B8) Err#2 ENOENT
26362: alarm(0) = 0
26362: open("/etc/c-client.cf", O_RDONLY) Err#2 ENOENT
26362: alarm(0) = 0
26362: door_info(3, 0xFFBFE810) = 0
26362: door_call(3, 0xFFBFE7F8) = 0
26362: door_info(3, 0xFFBFE810) = 0
26362: door_call(3, 0xFFBFE7F8) = 0
26362: door_info(3, 0xFFBFE810) = 0
26362: door_call(3, 0xFFBFE7F8) = 0
26362: uname(0xFFBFE350) = 1
26362: open("/etc/netconfig", O_RDONLY|O_LARGEFILE) = 4
[...]
26362: close(4) = 0
26362: door_info(3, 0xFFBFC688) = 0
26362: door_call(3, 0xFFBFC670) = 0
26362: open("/etc/nsswitch.conf", O_RDONLY|O_LARGEFILE) = 4
[...]
26362: close(4) = 0
26362: stat("/usr/lib/nss_nis.so.1", 0xFFBFBD50) = 0
[...]
26362: getpid() = 26362 [26361]
26362: open("/var/yp/binding/spi/cache_binding", O_RDONLY|O_LARGEFILE) = 4
etc
(I am wondering if my way of chosing which lines of truss output to
include makes me a good random number generator)

ba...@smaalders.net

unread,
May 6, 2005, 2:43:45 AM5/6/05
to
What you really want to do is to pstack the hung command.
I unplugged my ethernet cable and tried to ping sony.com;
the command hung immediately. I then pstacked it and got:

# pstack `pgrep ping`
801: ping sony.com
d0e6f068 door (3, 8044f48, 0, 0, 0, 3)
d0e0c4be _nsc_trydoorcall (8046f9c, 8046f98, 8046f94) + 1ba
d0f1bb23 _door_getipnodebyname_r (8047090, 809cf18, 809cf2c, 2120, 1a,
7) + 8f
d0f1e3ed _get_hostserv_inetnetdir_byname (809be80, 8047040, 8047038) +
6c1
d0f1c030 getipnodebyname (8047090, 1a, 7, 80471bc) + df
d0fa4569 get_addr (0, 80474e1, 80471f0, 80471f0, 0, 0) + 11d
d0fa41b6 _getaddrinfo (80474e1, 0, 80472b0, 80472e0, 0, 80472e4) + 3ca
d0fa442e getaddrinfo (80474e1, 0, 80472b0, 80472e0) + 16
08052c00 ???????? (80474e1, 0, 80472fc)
080527ba ???????? (804735c, 8047360)
08051ff5 main (2, 80473a0, 80473ac) + 539
08051a26 ???????? (2, 80474dc, 80474e1, 0, 80474ea, 8047524)

Which shows pretty clearly what the problem is from the client side.
This is generally easier than groking the nscd's stacks; it has
lots o' threads if it's been up for a while on a busy system.

Casper H.S. Dik

unread,
May 6, 2005, 7:14:17 AM5/6/05
to
ba...@smaalders.net writes:

But this just indicates it is trying to resolve the name
through nscd which it can't.

Casper
--
Expressed in this posting are my opinions. They are in no way related
to opinions held by my employer, Sun Microsystems.
Statements on Sun products included here are not gospel and may
be fiction rather than truth.

ba...@smaalders.net

unread,
May 7, 2005, 12:54:38 AM5/7/05
to

But we weren't even sure which name lookup was hanging, so this is an
easy way to
find out...

- Bart

0 new messages