NSS volumes suddenly lost

Jim Wagner

unread,

Nov 14, 2008, 3:03:36 PM11/14/08

to

We're using OES2 that according to YOU has all patches installed except
recent ones for Apache and MySQL. Today for no apparent reason all of
our users suddenly lost access to all NSS volumes. Because this server
runs all of our file and print services for hundreds of students and
teachers I didn't have much time to troubleshoot. I rebooted the
server, and after an automatic disk check everything returned to normal.

Needless to say this is very disconcerting and will be a huge problem if
it happens often. How can I go about doing a postmortem to figure out
what happened, and how might I prevent it from happening again? Thanks.

-Jim Wagner

brunold

unread,

Nov 16, 2008, 3:46:01 AM11/16/08

to

Jim,

I would start looking at the logfiles /var/log/messages and the
logfiles in /var/opt/novell/log. Hopefully you find there something in
that helps.

I also would prepare a list of commands I would run the next time
before restarting the server to collect some live information. Collect
from every of this commands the screen output and save them to a file
prior to restarting. That might help if it happens again.

ifconfig
df
mount
ps -ef
dmesg
ndsstat

Rainer

--
brunold
------------------------------------------------------------------------
brunold's Profile: http://forums.novell.com/member.php?userid=562
View this thread: http://forums.novell.com/showthread.php?t=350918

SROSeaner

unread,

Nov 17, 2008, 7:46:02 AM11/17/08

to

Do you run any antivirus software on that server? We run McAfee
LinuxShield, and it has extra caveats for NSS volumes on OES 2, and if
not followed exactly, volumes will disappear.

Our symptom is that they are still there: run -nsscon- and then the
command "-space-" and you will see the POOL and all volumes

But, they don't get shared via NCP: "-ncpcon volumes-" will show fewer
volumes than expected.

--
--------------------------------------------
::*Sean*:: [cpa|net+|linux+|cna|cne|cla|clp|nce-es]
--------------------------------------------
------------------------------------------------------------------------
SROSeaner's Profile: http://forums.novell.com/member.php?userid=45

Jim Wagner

unread,

Nov 17, 2008, 2:27:34 PM11/17/08

to

SROSeaner wrote:
> Do you run any antivirus software on that server? We run McAfee
> LinuxShield, and it has extra caveats for NSS volumes on OES 2, and if
> not followed exactly, volumes will disappear.
>
> Our symptom is that they are still there: run -nsscon- and then the
> command "-space-" and you will see the POOL and all volumes
>
> But, they don't get shared via NCP: "-ncpcon volumes-" will show fewer
> volumes than expected.
>

Unfortunately we don't yet have any antivirus software on this server.
I'll keep this in mind, however, as I hope to get some in the near future.

-J.W.

Jim Wagner

unread,

Nov 17, 2008, 2:26:14 PM11/17/08

to

brunold wrote:
> I would start looking at the logfiles /var/log/messages and the
> logfiles in /var/opt/novell/log. Hopefully you find there something in
> that helps.

Thanks very much for your reply. You gave me some good ideas. In fact,
I think I've found the source of the failure. I've included an excerpt
from /var/log/messages below. It looks like eDirectory (ndsd) died at
13:14:23. That could certainly account for the problem. Also below is
a piece from ncp2nss.log which apparently says that the NCP server was
unavailable beginning a the same time.

Would you agree that the failure of ndsd appears to be the source here?
There are apparently no patches available for it. Any ideas what
would cause this or how to prevent it again? If it should happen again,
do you suppose it would work to simply restart ndsd?

> I also would prepare a list of commands I would run the next time
> before restarting the server to collect some live information. Collect
> from every of this commands the screen output and save them to a file
> prior to restarting. That might help if it happens again.
>

Thanks, I will save the output of those commands now while the server is
running properly so that I have something to compare with if the problem
should happen again.

From /var/log/messages:
Nov 14 12:53:54 AV3 smbd[14753]: read_data: read failure for 4 bytes
to client 10.0.2.80. Error = No route to host
Nov 14 13:04:24 AV3 syslog-ng[4295]: STATS: dropped 2
Nov 14 13:14:23 AV3 kernel: ndsd[12856]: segfault at 0000000000003600
rip 00000000f7f4aced rsp 00000000e91684a0 error 6
Nov 14 13:14:23 AV3 /usr/sbin/namcd[8565]: monitorChangesInLDAP:
ldap_result: Can't contact LDAP server
Nov 14 13:14:27 AV3 /usr/sbin/namcd[8565]: ldap_initconn: LDAP bind
failed (error = [81]), trying to connect to alternative LDAP server
Nov 14 13:14:27 AV3 /usr/sbin/namcd[8565]: Unknown error returned
reading configuration parameter: alternative-ldap-server-list
Nov 14 13:14:31 AV3 /usr/sbin/namcd[8565]: ldap_initconn: LDAP bind
failed (error = [81]), trying to connect to alternative LDAP server
Nov 14 13:14:31 AV3 /usr/sbin/namcd[8565]: Unknown error returned
reading configuration parameter: alternative-ldap-server-list
Nov 14 13:14:44 AV3 smbd[13527]: [2008/11/14 13:14:44, 0]
lib/smbldap.c:smbldap_connect_system(982)
Nov 14 13:14:44 AV3 smbd[13527]: failed to bind to server
ldaps://10.0.0.3:636 with dn="cn=AV3-sambaProxy,o=AV" Error: Can't
contact LDAP server
Nov 14 13:14:44 AV3 smbd[13527]: (unknown)

From /var/opt/novell/log/ncp2nss.log:
[! 2008-11-14 13:14:23] IPCClient::Open connect failed rc=111
[! 2008-11-14 13:14:23] IPCServRequest open/send/received failed rc=111
[! 2008-11-14 13:14:23] ... ncp server ping FAILED rc=111

-J.W.

brunold

unread,

Nov 18, 2008, 10:16:02 AM11/18/08

to

Jim,

yes, I think ndsd is the reason for this crashes.

I think there is another parameter you should monitor, depending on the
amount of users connections, ldap requests and such things, the default
thread count for the edirectory might be set to a too low value. Check
your current value running "ndsconfig get" and check the value for
n4u.server.max-threads. The current statistics for those threads can be
queried using "ndstrace -c threads" and look like this one here:

# NDSTRACE -C THREADS

[1] INSTANCE AT /ETC/OPT/NOVELL/EDIRECTORY/CONF/NDS.CONF: LX-...
THREAD POOL INFORMATION
SUMMARY : SPAWNED 107, DIED 77
*POOL WORKERS : IDLE 9, TOTAL 30, PEAK 39
Ready Work : Current 1, Peak 7, maxWait 936775 us
Sched delay : Min 4 us, Max 1289182 us, Avg: 39 us
Waiting Work : Current 17, Peak 23*

Monitor the Pool Workers Peak and Total value and see if that comes
close to the max configured threads. Run this command a view times
during the day to check those values.

Also an option could be to send the edirectory core file to Novell to
let them analyse it and see what the reason was for that core. This can
be done when you open a service request ...

Rainer

--
brunold
------------------------------------------------------------------------
brunold's Profile: http://forums.novell.com/member.php?userid=562

Jim Wagner

unread,

Nov 20, 2008, 2:01:57 PM11/20/08

to

brunold wrote:
> I think there is another parameter you should monitor, depending on the
> amount of users connections, ldap requests and such things, the default
> thread count for the edirectory might be set to a too low value. Check
> your current value running "ndsconfig get" and check the value for
> n4u.server.max-threads. The current statistics for those threads can be
> queried using "ndstrace -c threads" and look like this one here:
>

> Monitor the Pool Workers Peak and Total value and see if that comes
> close to the max configured threads. Run this command a view times
> during the day to check those values.

Right now the peak pool workers value is 64, while max-threads is 128,
so things seem to be within a comfortable margin right now. Is the
"peak" value the peak since the server was last booted?

-J.W.

WillR

unread,

Nov 21, 2008, 2:34:02 PM11/21/08

to

We have been suffering from these crashes as well. Today -- many.

When we have the crashes -- if you examine the server --

* The NSS volumes are stilled mounted and viewable through file manager
* You cannot link to the server via apache on a local workstation
running a browser.
* the small diagnostic stack does not respond.
* All traffic halts on the card -- according to ifconfig.
* ifdown followed by ifup does not restart the net interface.
* ifconfig shows the net i/f as being available

We are using a marvel interface (pci) on an asus A8R32-MVP board.

I am convinced that this is a network card problem -- not even an
ncp2nss issue.

hgittler

unread,

Nov 24, 2008, 9:46:02 AM11/24/08

to

I think this error is a memory consumption problem
If you look in the Remote Manager in "Manage Linux" "View Process
Information", in the "ndsd" process the amount of "File Descriptors" is
grow
till all the memory is consumed.
Then in the ncp2nss.log the error
[! 2008-11-06 14:14:56] IPCClient::Open connect failed rc=111
[! 2008-11-06 14:14:56] IPCServRequest open/send/received failed
rc=111
[! 2008-11-06 14:14:56] ... ncp server ping FAILED rc=111
will appear and no user can access the NSS volumes any more.

Is there a solution?

Best regards
hgittler

--
hgittler
------------------------------------------------------------------------
hgittler's Profile: http://forums.novell.com/member.php?userid=6270

WillR

unread,

Nov 25, 2008, 12:12:49 PM11/25/08

to

hgittler wrote:
> I think this error is a memory consumption problem
> If you look in the Remote Manager in "Manage Linux" "View Process
> Information", in the "ndsd" process the amount of "File Descriptors" is
> grow
> till all the memory is consumed.
> Then in the ncp2nss.log the error
> [! 2008-11-06 14:14:56] IPCClient::Open connect failed rc=111
> [! 2008-11-06 14:14:56] IPCServRequest open/send/received failed
> rc=111
> [! 2008-11-06 14:14:56] ... ncp server ping FAILED rc=111
> will appear and no user can access the NSS volumes any more.
>
> Is there a solution?
>
> Best regards
> hgittler
>
>

Here is the stuff from my log.

Maybe someone has and ide they can pass on.

[! 2008-11-21 10:20:25] verify "/_admin/Manage_NSS/manage.cmd"
[! 2008-11-21 10:20:25] "/_admin/Manage_NSS/manage.cmd" is OK
[! 2008-11-21 10:20:25] verify "/_admin/Manage_NSS/files.cmd"
[! 2008-11-21 10:20:25] "/_admin/Manage_NSS/files.cmd" is OK
[! 2008-11-21 10:20:25] verify "/_admin/Manage_NSS/linux.cmd"
[! 2008-11-21 10:20:25] "/_admin/Manage_NSS/linux.cmd" is OK
[! 2008-11-21 10:20:25] IPCListenerThread: system stack size = 8388608
[! 2008-11-21 10:34:18] IPCListenerThread: accept rc=9
[! 2008-11-21 10:34:18] IPCClient::Open connect failed rc=2
[! 2008-11-21 10:34:18] IPCServRequest open/send/received failed rc=2
[i 2008-11-21 10:34:18] ... ncp2nss daemon halted (15)

[! 2008-11-21 10:37:07] Check the NSS management path "/_admin/Manage_NSS"
[! 2008-11-21 10:37:08] The NSS management path "/_admin/Manage_NSS" is OK
[! 2008-11-21 10:37:08] verify "/_admin/Manage_NSS/manage.cmd"
[! 2008-11-21 10:37:08] "/_admin/Manage_NSS/manage.cmd" is OK
[! 2008-11-21 10:37:08] verify "/_admin/Manage_NSS/files.cmd"
[! 2008-11-21 10:37:08] "/_admin/Manage_NSS/files.cmd" is OK
[! 2008-11-21 10:37:08] verify "/_admin/Manage_NSS/linux.cmd"
[! 2008-11-21 10:37:08] "/_admin/Manage_NSS/linux.cmd" is OK
[! 2008-11-21 10:37:08] IPCListenerThread: system stack size = 8388608
[! 2008-11-21 12:06:44] IPCListenerThread: accept rc=9
[! 2008-11-21 12:06:44] IPCClient::Open connect failed rc=2
[! 2008-11-21 12:06:44] IPCServRequest open/send/received failed rc=2
[i 2008-11-21 12:06:44] ... ncp2nss daemon halted (15)

[! 2008-11-21 12:09:33] Check the NSS management path "/_admin/Manage_NSS"
[! 2008-11-21 12:09:34] The NSS management path "/_admin/Manage_NSS" is OK
[! 2008-11-21 12:09:34] verify "/_admin/Manage_NSS/manage.cmd"
[! 2008-11-21 12:09:34] "/_admin/Manage_NSS/manage.cmd" is OK
[! 2008-11-21 12:09:34] verify "/_admin/Manage_NSS/files.cmd"
[! 2008-11-21 12:09:34] "/_admin/Manage_NSS/files.cmd" is OK
[! 2008-11-21 12:09:34] verify "/_admin/Manage_NSS/linux.cmd"
[! 2008-11-21 12:09:34] "/_admin/Manage_NSS/linux.cmd" is OK
[! 2008-11-21 12:09:34] IPCListenerThread: system stack size = 8388608
[! 2008-11-21 12:18:36] IPCListenerThread: accept rc=9
[! 2008-11-21 12:18:36] IPCClient::Open connect failed rc=2
[! 2008-11-21 12:18:36] IPCServRequest open/send/received failed rc=2
[i 2008-11-21 12:18:36] ... ncp2nss daemon halted (15)

[! 2008-11-21 12:22:05] Check the NSS management path "/_admin/Manage_NSS"
[! 2008-11-21 12:22:07] The NSS management path "/_admin/Manage_NSS" is OK
[! 2008-11-21 12:22:07] verify "/_admin/Manage_NSS/manage.cmd"
[! 2008-11-21 12:22:07] "/_admin/Manage_NSS/manage.cmd" is OK
[! 2008-11-21 12:22:07] verify "/_admin/Manage_NSS/files.cmd"
[! 2008-11-21 12:22:07] "/_admin/Manage_NSS/files.cmd" is OK
[! 2008-11-21 12:22:07] verify "/_admin/Manage_NSS/linux.cmd"
[! 2008-11-21 12:22:07] "/_admin/Manage_NSS/linux.cmd" is OK
[! 2008-11-21 12:22:07] IPCListenerThread: system stack size = 8388608

WillR

unread,

Nov 26, 2008, 3:57:16 PM11/26/08

to

brunold wrote:
> Anybody has a service contract with Novell and can open a service
> request on this ?
>
> Rainer
>
>

No service contract.

It looks like this is a bug though.

kjhurni

unread,

Nov 26, 2008, 2:36:01 PM11/26/08

to

SROSeaner;1681020 Wrote:
> Do you run any antivirus software on that server? We run McAfee
> LinuxShield, and it has extra caveats for NSS volumes on OES 2, and if
> not followed exactly, volumes will disappear.
>
> Our symptom is that they are still there: run -nsscon- and then the
> command "-space-" and you will see the POOL and all volumes
>
> But, they don't get shared via NCP: "-ncpcon volumes-" will show fewer
> volumes than expected.

Which caveats are those for NSS and LinuxShield? I didn't see anything
in the documentation about anything like that (other than you had to
have the LUM enabled users in eDir and set the file rights properly so
that it could scan the NSS volumes).

--
kjhurni
------------------------------------------------------------------------
kjhurni's Profile: http://forums.novell.com/member.php?userid=734

kjhurni

unread,

Nov 26, 2008, 3:16:03 PM11/26/08

to

Thanks for the info! Glad I found out now rather than later.

Yes, I learned a little secret after that last roundabout (with your
help, I might add).

We have Gold Support with McAfee. Apparently if you use the support
portal, you get India Tech Support (and they have no clue, and I had the
similar experience that you did where it sits in limbo for weeks and
then they want to know if you're still having the problem).

I ended up escalating through our sales partner and that's when I found
out that if we want "good" support to call the 800#.

Eeeesh.

After that, it took about 2 weeks and they finally acknowledged a bug
in their documentation and supposedly fixed it (although I admit I was
bad and haven't confirmed that it truly did get fixed).

kjhurni

unread,

Nov 26, 2008, 3:06:02 PM11/26/08

to

brunold;1688011 Wrote:
> kjhurni,
>
> one thing I know is to exclude the ._NETWARE directory from the nss
> volumes from scanning.
>
> Rainer

Interesting.

Is that in the LinuxShield docs? I didn't see that in the 1.5.1 docs.

If it's supposed to be in there, let me know so I can bug McAfee to
correct the documentation. (for example, McAfee has documented that you
need to exclude certain EPO/CMA directories from VirusScan on Windows).

kjhurni

unread,

Nov 26, 2008, 5:46:02 PM11/26/08

to

hgittler;1685920 Wrote:
> I think this error is a memory consumption problem
> If you look in the Remote Manager in "Manage Linux" "View Process
> Information", in the "ndsd" process the amount of "File Descriptors" is
> grow
> till all the memory is consumed.
> Then in the ncp2nss.log the error
> [! 2008-11-06 14:14:56] IPCClient::Open connect failed rc=111
> [! 2008-11-06 14:14:56] IPCServRequest open/send/received failed
> rc=111
> [! 2008-11-06 14:14:56] ... ncp server ping FAILED rc=111
> will appear and no user can access the NSS volumes any more.
>
> Is there a solution?
>
> Best regards
> hgittler

I'm noticing this too on my OES2 server in the test lab. However,
that's a VMWARE ESX machine and it's not patched at all.

My production machine on "real" hardware I've not seen this at all.
However, that's a 4 GB RAM server. The ESX machine is only getting 1.5
GB of RAM (and nobody ever uses it)

But the production machine is fully patched as well.

This is going to sound stupid, but is your OES2 machine fully patched?
(SLES 10 SP1 patches and OES2 patches?) I don't know if that's the
issue or not, but I know i had issues with NSS on my production box
until I patched things.

--
kjhurni
------------------------------------------------------------------------
kjhurni's Profile: http://forums.novell.com/member.php?userid=734

brunold

unread,

Nov 26, 2008, 3:16:03 PM11/26/08

to

This is something we learned with out nearly 200 oes 1 linux systems
running LinuxShield 1.4. There is also a tid about a cluster failover
problem when it is included:

'Cluster volume resources do not fail over when running McAfee
LinuxShield' (http://tinyurl.com/5gxwy5)

Regarding some McAfee support cases ... I opened one to tell them their
rpm post installation scripts were invalid and prevent a automatic
installation using something like zenworks linux management. Whereas the
problem were not related to zlm, they were related to the unattended
response file that might be used for that type of installation. After
about 4 weeks on nothing on that support case (no email and no phone
call during that time), a technician asked if the problem still exists
and they might have a look in a future version. Thanks for that. We
extracted the McAfee rpms and built them on our own to fix it.

Rainer

--
brunold
------------------------------------------------------------------------
brunold's Profile: http://forums.novell.com/member.php?userid=562

mwilmsen

unread,

Nov 28, 2008, 4:06:02 AM11/28/08

to

mwilmsen;1688939 Wrote:
> just for the record.. we use a ibm server. Don't know if that matters?

I just came up with the following but don't know if it wil work.

Assuming ndsd is the process who is claiming all the memory. Does your
servers
have a local replica? And does it make a difference if you remove that
replica to another
server?

Michael

--
mwilmsen
------------------------------------------------------------------------
mwilmsen's Profile: http://forums.novell.com/member.php?userid=5116

mwilmsen

unread,

Nov 28, 2008, 3:56:01 AM11/28/08

to

mwilmsen;1688396 Wrote:
> My machine has 6GB of memory, fully patches and it's also filling up.
> Even when I restart ndsd the memory does not get free.
>
> Does anybody know how I manually can free the memory. Maybe a cronjob
> can do the trick for the moment.
>
> But it is a really annoying bug!!!

just for the record.. we use a ibm server. Don't know if that matters?

--
mwilmsen
------------------------------------------------------------------------
mwilmsen's Profile: http://forums.novell.com/member.php?userid=5116

Joseph Marton

unread,

Dec 19, 2008, 10:53:16 AM12/19/08

to

On Fri, 19 Dec 2008 15:36:03 +0000, hgittler wrote:

> Is there any possibility to clean the ndsd process memory without
> restart the server?

You could always restart ndsd though to clients that's pretty much the
same thing as restarting the server. It's just a lot quicker.

Are you on OES2 or OES2 SP1?

--
Joe Marton
Novell Support Forum SysOp
See what GroupWise 8 can do for you.
http://www.novell.com/products/groupwise/

Joseph Marton

unread,

Feb 3, 2009, 10:26:53 AM2/3/09

to

On Tue, 03 Feb 2009 15:16:03 +0000, hgittler wrote:

> But now I have the problem with the named service because I have a DNS
> Netware Server with a _ in its name (SERVER_VM02) . The named service
> will not start till I stop the service manually. Delete all files in
> /etc/opt/novell/named/ and start the named service again.

In iManager, edit this zone and disable the check-names parameter.

Joseph Marton

unread,

Feb 9, 2009, 9:39:42 AM2/9/09

to

On Mon, 09 Feb 2009 10:36:02 +0000, hgittler wrote:

> Sorry but there is no "check-names" parameter in the zone which I can
> disable with iManager. And if I add the "dnipAdditionaloptions"
> "check-names master ignore;"

Leave out the world "master".

http://www.novell.com/support/viewContent.do?externalId=3775731