Needless to say this is very disconcerting and will be a huge problem if
it happens often. How can I go about doing a postmortem to figure out
what happened, and how might I prevent it from happening again? Thanks.
-Jim Wagner
I would start looking at the logfiles /var/log/messages and the
logfiles in /var/opt/novell/log. Hopefully you find there something in
that helps.
I also would prepare a list of commands I would run the next time
before restarting the server to collect some live information. Collect
from every of this commands the screen output and save them to a file
prior to restarting. That might help if it happens again.
ifconfig
df
mount
ps -ef
dmesg
ndsstat
Rainer
--
brunold
------------------------------------------------------------------------
brunold's Profile: http://forums.novell.com/member.php?userid=562
View this thread: http://forums.novell.com/showthread.php?t=350918
Our symptom is that they are still there: run -nsscon- and then the
command "-space-" and you will see the POOL and all volumes
But, they don't get shared via NCP: "-ncpcon volumes-" will show fewer
volumes than expected.
--
--------------------------------------------
::*Sean*:: [cpa|net+|linux+|cna|cne|cla|clp|nce-es]
--------------------------------------------
------------------------------------------------------------------------
SROSeaner's Profile: http://forums.novell.com/member.php?userid=45
Unfortunately we don't yet have any antivirus software on this server.
I'll keep this in mind, however, as I hope to get some in the near future.
-J.W.
Thanks very much for your reply. You gave me some good ideas. In fact,
I think I've found the source of the failure. I've included an excerpt
from /var/log/messages below. It looks like eDirectory (ndsd) died at
13:14:23. That could certainly account for the problem. Also below is
a piece from ncp2nss.log which apparently says that the NCP server was
unavailable beginning a the same time.
Would you agree that the failure of ndsd appears to be the source here?
There are apparently no patches available for it. Any ideas what
would cause this or how to prevent it again? If it should happen again,
do you suppose it would work to simply restart ndsd?
> I also would prepare a list of commands I would run the next time
> before restarting the server to collect some live information. Collect
> from every of this commands the screen output and save them to a file
> prior to restarting. That might help if it happens again.
>
Thanks, I will save the output of those commands now while the server is
running properly so that I have something to compare with if the problem
should happen again.
From /var/log/messages:
Nov 14 12:53:54 AV3 smbd[14753]: read_data: read failure for 4 bytes
to client 10.0.2.80. Error = No route to host
Nov 14 13:04:24 AV3 syslog-ng[4295]: STATS: dropped 2
Nov 14 13:14:23 AV3 kernel: ndsd[12856]: segfault at 0000000000003600
rip 00000000f7f4aced rsp 00000000e91684a0 error 6
Nov 14 13:14:23 AV3 /usr/sbin/namcd[8565]: monitorChangesInLDAP:
ldap_result: Can't contact LDAP server
Nov 14 13:14:27 AV3 /usr/sbin/namcd[8565]: ldap_initconn: LDAP bind
failed (error = [81]), trying to connect to alternative LDAP server
Nov 14 13:14:27 AV3 /usr/sbin/namcd[8565]: Unknown error returned
reading configuration parameter: alternative-ldap-server-list
Nov 14 13:14:31 AV3 /usr/sbin/namcd[8565]: ldap_initconn: LDAP bind
failed (error = [81]), trying to connect to alternative LDAP server
Nov 14 13:14:31 AV3 /usr/sbin/namcd[8565]: Unknown error returned
reading configuration parameter: alternative-ldap-server-list
Nov 14 13:14:44 AV3 smbd[13527]: [2008/11/14 13:14:44, 0]
lib/smbldap.c:smbldap_connect_system(982)
Nov 14 13:14:44 AV3 smbd[13527]: failed to bind to server
ldaps://10.0.0.3:636 with dn="cn=AV3-sambaProxy,o=AV" Error: Can't
contact LDAP server
Nov 14 13:14:44 AV3 smbd[13527]: (unknown)
From /var/opt/novell/log/ncp2nss.log:
[! 2008-11-14 13:14:23] IPCClient::Open connect failed rc=111
[! 2008-11-14 13:14:23] IPCServRequest open/send/received failed rc=111
[! 2008-11-14 13:14:23] ... ncp server ping FAILED rc=111
-J.W.
yes, I think ndsd is the reason for this crashes.
I think there is another parameter you should monitor, depending on the
amount of users connections, ldap requests and such things, the default
thread count for the edirectory might be set to a too low value. Check
your current value running "ndsconfig get" and check the value for
n4u.server.max-threads. The current statistics for those threads can be
queried using "ndstrace -c threads" and look like this one here:
# NDSTRACE -C THREADS
[1] INSTANCE AT /ETC/OPT/NOVELL/EDIRECTORY/CONF/NDS.CONF: LX-...
THREAD POOL INFORMATION
SUMMARY : SPAWNED 107, DIED 77
*POOL WORKERS : IDLE 9, TOTAL 30, PEAK 39
Ready Work : Current 1, Peak 7, maxWait 936775 us
Sched delay : Min 4 us, Max 1289182 us, Avg: 39 us
Waiting Work : Current 17, Peak 23*
Monitor the Pool Workers Peak and Total value and see if that comes
close to the max configured threads. Run this command a view times
during the day to check those values.
Also an option could be to send the edirectory core file to Novell to
let them analyse it and see what the reason was for that core. This can
be done when you open a service request ...
Rainer
--
brunold
------------------------------------------------------------------------
brunold's Profile: http://forums.novell.com/member.php?userid=562
Right now the peak pool workers value is 64, while max-threads is 128,
so things seem to be within a comfortable margin right now. Is the
"peak" value the peak since the server was last booted?
-J.W.
We have been suffering from these crashes as well. Today -- many.
When we have the crashes -- if you examine the server --
* The NSS volumes are stilled mounted and viewable through file manager
* You cannot link to the server via apache on a local workstation
running a browser.
* the small diagnostic stack does not respond.
* All traffic halts on the card -- according to ifconfig.
* ifdown followed by ifup does not restart the net interface.
* ifconfig shows the net i/f as being available
We are using a marvel interface (pci) on an asus A8R32-MVP board.
I am convinced that this is a network card problem -- not even an
ncp2nss issue.
Is there a solution?
Best regards
hgittler
--
hgittler
------------------------------------------------------------------------
hgittler's Profile: http://forums.novell.com/member.php?userid=6270
Here is the stuff from my log.
Maybe someone has and ide they can pass on.
[! 2008-11-21 10:20:25] verify "/_admin/Manage_NSS/manage.cmd"
[! 2008-11-21 10:20:25] "/_admin/Manage_NSS/manage.cmd" is OK
[! 2008-11-21 10:20:25] verify "/_admin/Manage_NSS/files.cmd"
[! 2008-11-21 10:20:25] "/_admin/Manage_NSS/files.cmd" is OK
[! 2008-11-21 10:20:25] verify "/_admin/Manage_NSS/linux.cmd"
[! 2008-11-21 10:20:25] "/_admin/Manage_NSS/linux.cmd" is OK
[! 2008-11-21 10:20:25] IPCListenerThread: system stack size = 8388608
[! 2008-11-21 10:34:18] IPCListenerThread: accept rc=9
[! 2008-11-21 10:34:18] IPCClient::Open connect failed rc=2
[! 2008-11-21 10:34:18] IPCServRequest open/send/received failed rc=2
[i 2008-11-21 10:34:18] ... ncp2nss daemon halted (15)
[! 2008-11-21 10:37:07] Check the NSS management path "/_admin/Manage_NSS"
[! 2008-11-21 10:37:08] The NSS management path "/_admin/Manage_NSS" is OK
[! 2008-11-21 10:37:08] verify "/_admin/Manage_NSS/manage.cmd"
[! 2008-11-21 10:37:08] "/_admin/Manage_NSS/manage.cmd" is OK
[! 2008-11-21 10:37:08] verify "/_admin/Manage_NSS/files.cmd"
[! 2008-11-21 10:37:08] "/_admin/Manage_NSS/files.cmd" is OK
[! 2008-11-21 10:37:08] verify "/_admin/Manage_NSS/linux.cmd"
[! 2008-11-21 10:37:08] "/_admin/Manage_NSS/linux.cmd" is OK
[! 2008-11-21 10:37:08] IPCListenerThread: system stack size = 8388608
[! 2008-11-21 12:06:44] IPCListenerThread: accept rc=9
[! 2008-11-21 12:06:44] IPCClient::Open connect failed rc=2
[! 2008-11-21 12:06:44] IPCServRequest open/send/received failed rc=2
[i 2008-11-21 12:06:44] ... ncp2nss daemon halted (15)
[! 2008-11-21 12:09:33] Check the NSS management path "/_admin/Manage_NSS"
[! 2008-11-21 12:09:34] The NSS management path "/_admin/Manage_NSS" is OK
[! 2008-11-21 12:09:34] verify "/_admin/Manage_NSS/manage.cmd"
[! 2008-11-21 12:09:34] "/_admin/Manage_NSS/manage.cmd" is OK
[! 2008-11-21 12:09:34] verify "/_admin/Manage_NSS/files.cmd"
[! 2008-11-21 12:09:34] "/_admin/Manage_NSS/files.cmd" is OK
[! 2008-11-21 12:09:34] verify "/_admin/Manage_NSS/linux.cmd"
[! 2008-11-21 12:09:34] "/_admin/Manage_NSS/linux.cmd" is OK
[! 2008-11-21 12:09:34] IPCListenerThread: system stack size = 8388608
[! 2008-11-21 12:18:36] IPCListenerThread: accept rc=9
[! 2008-11-21 12:18:36] IPCClient::Open connect failed rc=2
[! 2008-11-21 12:18:36] IPCServRequest open/send/received failed rc=2
[i 2008-11-21 12:18:36] ... ncp2nss daemon halted (15)
[! 2008-11-21 12:22:05] Check the NSS management path "/_admin/Manage_NSS"
[! 2008-11-21 12:22:07] The NSS management path "/_admin/Manage_NSS" is OK
[! 2008-11-21 12:22:07] verify "/_admin/Manage_NSS/manage.cmd"
[! 2008-11-21 12:22:07] "/_admin/Manage_NSS/manage.cmd" is OK
[! 2008-11-21 12:22:07] verify "/_admin/Manage_NSS/files.cmd"
[! 2008-11-21 12:22:07] "/_admin/Manage_NSS/files.cmd" is OK
[! 2008-11-21 12:22:07] verify "/_admin/Manage_NSS/linux.cmd"
[! 2008-11-21 12:22:07] "/_admin/Manage_NSS/linux.cmd" is OK
[! 2008-11-21 12:22:07] IPCListenerThread: system stack size = 8388608
No service contract.
It looks like this is a bug though.
Which caveats are those for NSS and LinuxShield? I didn't see anything
in the documentation about anything like that (other than you had to
have the LUM enabled users in eDir and set the file rights properly so
that it could scan the NSS volumes).
--
kjhurni
------------------------------------------------------------------------
kjhurni's Profile: http://forums.novell.com/member.php?userid=734
Yes, I learned a little secret after that last roundabout (with your
help, I might add).
We have Gold Support with McAfee. Apparently if you use the support
portal, you get India Tech Support (and they have no clue, and I had the
similar experience that you did where it sits in limbo for weeks and
then they want to know if you're still having the problem).
I ended up escalating through our sales partner and that's when I found
out that if we want "good" support to call the 800#.
Eeeesh.
After that, it took about 2 weeks and they finally acknowledged a bug
in their documentation and supposedly fixed it (although I admit I was
bad and haven't confirmed that it truly did get fixed).
Interesting.
Is that in the LinuxShield docs? I didn't see that in the 1.5.1 docs.
If it's supposed to be in there, let me know so I can bug McAfee to
correct the documentation. (for example, McAfee has documented that you
need to exclude certain EPO/CMA directories from VirusScan on Windows).
I'm noticing this too on my OES2 server in the test lab. However,
that's a VMWARE ESX machine and it's not patched at all.
My production machine on "real" hardware I've not seen this at all.
However, that's a 4 GB RAM server. The ESX machine is only getting 1.5
GB of RAM (and nobody ever uses it)
But the production machine is fully patched as well.
This is going to sound stupid, but is your OES2 machine fully patched?
(SLES 10 SP1 patches and OES2 patches?) I don't know if that's the
issue or not, but I know i had issues with NSS on my production box
until I patched things.
--
kjhurni
------------------------------------------------------------------------
kjhurni's Profile: http://forums.novell.com/member.php?userid=734
'Cluster volume resources do not fail over when running McAfee
LinuxShield' (http://tinyurl.com/5gxwy5)
Regarding some McAfee support cases ... I opened one to tell them their
rpm post installation scripts were invalid and prevent a automatic
installation using something like zenworks linux management. Whereas the
problem were not related to zlm, they were related to the unattended
response file that might be used for that type of installation. After
about 4 weeks on nothing on that support case (no email and no phone
call during that time), a technician asked if the problem still exists
and they might have a look in a future version. Thanks for that. We
extracted the McAfee rpms and built them on our own to fix it.
Rainer
--
brunold
------------------------------------------------------------------------
brunold's Profile: http://forums.novell.com/member.php?userid=562
I just came up with the following but don't know if it wil work.
Assuming ndsd is the process who is claiming all the memory. Does your
servers
have a local replica? And does it make a difference if you remove that
replica to another
server?
Michael
--
mwilmsen
------------------------------------------------------------------------
mwilmsen's Profile: http://forums.novell.com/member.php?userid=5116
just for the record.. we use a ibm server. Don't know if that matters?
--
mwilmsen
------------------------------------------------------------------------
mwilmsen's Profile: http://forums.novell.com/member.php?userid=5116
> Is there any possibility to clean the ndsd process memory without
> restart the server?
You could always restart ndsd though to clients that's pretty much the
same thing as restarting the server. It's just a lot quicker.
Are you on OES2 or OES2 SP1?
--
Joe Marton
Novell Support Forum SysOp
See what GroupWise 8 can do for you.
http://www.novell.com/products/groupwise/
> But now I have the problem with the named service because I have a DNS
> Netware Server with a _ in its name (SERVER_VM02) . The named service
> will not start till I stop the service manually. Delete all files in
> /etc/opt/novell/named/ and start the named service again.
In iManager, edit this zone and disable the check-names parameter.
> Sorry but there is no "check-names" parameter in the zone which I can
> disable with iManager. And if I add the "dnipAdditionaloptions"
> "check-names master ignore;"
Leave out the world "master".
http://www.novell.com/support/viewContent.do?externalId=3775731