Re: Painfully slow logins...eDir problem?

a...@novell.com

unread,

Dec 2, 2009, 3:08:08 PM12/2/09

to

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I'd start by getting a LAN trace of the login taking forever to see where
the delays are. Does a login via something like ndslogin or dsrepair work
on the servers with replicas themselves? It sounds like it may just be a
client issue or a combination of services overwhelming the systems.

Good luck.

ryamry wrote:
> Hey everyone. We are having ridiculously slow logins here. They dont
> happen all the time and are somewhat intermittent. Sometime it takes
> upwards of 20 minutes to login. I have found a few workstations that
> the 897c and error 15(8819) errors during the login process and mapping
> drives. Also, on the slow logins the NAL window takes a very long time
> (5-15 minutes) to load and show the icons.
>
> I have this posted in the the 'zenworks forum as well, here'
> (http://forums.novell.com/novell-product-support-forums/zenworks/desktop-management/zw-desktop-management-7x/zdm7-install-setup/394462-logins-painfully-slow-post1897869.html#poststop).
> Im posting this here based on what Massimo has found in packet traces.
> From his findings he says:
>
>> In that case, if a thorough dsrepair doesn't find anything, something
>> must be quite seriously wrong with that server. It may be useful in
>> that
>> case if you'd post this in the matching eDirectory group for further
>> tips on how to possibly debug this. At any rate, the issue is clearly
>> visible, the client asks the server for that group, and the server
>> plain
>> and simple refuses to answer, until the client eventually times out,
>> resets the whole eDir connection to that server, and then retries
>> (with
>> a different server).
>>
>> and
>>
>> Oh, and another thing: After the client has eventually timed out
>> waiting
>> for the server 10.1.2.22 to respond to it's request, it rests the
>> whole
>> connection to that server, then reconnects to 10.1.2.23, and starts
>> from
>> scratch, this time succesfully. So the core problem is certainly
>> 10.1.2.22, or that servers connection to the other servers.
>>
>> and
>>
>> In addition to these immediately visible problems, your overall
>> eDirectory performance (in terms of response times even to working
>> eDir
>> requests) is truly atrocious, and not only from the server at
>> 10.1.2.22,
>> but basically all eDir servers. Response times of often well over one
>> second for simple requests for e.eg a single attribute of a given
>> object
>> are way unnormal, and especially with Zenworks (which does hundreds of
>> such requests), also add up to the overall poor performance, where
>> even
>> "working" logins take several minutes until they're complete.
>>
>
> I have packet traces that are availale in the previous thread as well.
> Ive ran through a full eDir health check per TID 3564075 and there were
> no errors.
>
> I was hoping someone here could help me out finding the problem. Like
> I said logins are super slow and everybody is complaining about it...and
> rightfully so!
>
> I have 9 servers with 5 remote locations connected via 54Mbps wireless
> links. All buildings are having the slowness problems which started 2-3
> weeks ago. The three main servers here are SLES10 SP3 with OES2 SP2 that
> run on VSphere 4. The ZENWorks server is v7 SP1 IR4a HP2. All the
> other (remote) servers are SLES10 SP2 OES2 SP1. Everything is patched up
> to date.
>
> Id appreciate your help!
>
> -Rob
>
>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQIcBAEBAgAGBQJLFsknAAoJEF+XTK08PnB5NRYQALpJqxg0UXClfKjK8b6C8/Uf
QSb4bsrOd/uoBaKEuYo5DOJhdWjPvljeOFQ8CQkTI7jsanH2FZ+onng5cOCbD/98
T77QwGaiNaOfVuGjGlV9/Wvbw8MvEkUGHgcwTMQi7bmZro7DA4HumOsY+RAE1lfR
/VNTUZVXCArA76fGc9+rtBqiYPQw6bhZtgkLPD9sn9bo3hldDaHfQkBnNXF0S1x7
eYOeThcbTl5X629OBmg7dwmjxp2ViJlGVOU0Wp17d6P5MBy+U5S1FFN5R/s8wxr7
gZBeU/jnoGANmI/s2fMqmilVT6zQstzSYqY0PEcLE78DcDnvOq0uhKUeDQMUZodY
MwJYKH51FS7195ahRELVxJHeXz0FdsTRY/NVmW1qKu9iLmCyroTJhIumNqTRCrto
q/Q9JcFB+MO1l83tRmWubeI+mOg8Q5kve16Gz8MgQbk3yOEVbg33fNvkyynuYaRj
ygbDF8CfZsNfVX3huGDBLdn3N3uDJUEsjZs7CfzNzPUHFNJpn+P+4xtlsirYC63O
0S2GvQ4CfnjUlWcMrtFOHbrjxlcAlEUhJ4Tw+uFeHXfQfjNWRVcZBa4CZsj4mUQ/
gSKXKIjdCtVeZ9H2aUHwyZgnpeTvBFyRCkkJC/XjPrF1oyCOko6xC8mKcur6qUbE
WEHmmmtGrL+3+Y0sfbMm
=wWu6
-----END PGP SIGNATURE-----

Massimo Rosen

unread,

Dec 3, 2009, 5:23:58 AM12/3/09

to

Aaron,

"a...@novell.com" wrote:
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> I'd start by getting a LAN trace of the login taking forever to see where
> the delays are.

Umm, seriously, care to read the whole posting you're responding to?

"Im posting this here based on what Massimo has found in packet traces.
"

His edir server(s) occasionally refuse to answer to simple requests like
reading a group object. I.E, the server acks the request at tcp/ip
level, but *never* returns the requested data. That lead to timeouts of
several minutes, until the client gives up and resets the connection.

CU,
--
Massimo Rosen
Novell Product Support Forum Sysop
No emails please!
http://www.cfc-it.de

David Gersic

unread,

Dec 4, 2009, 1:14:23 PM12/4/09

to

On Fri, 04 Dec 2009 16:46:02 +0000, ryamry wrote:

>> .wdlfas01.WDL.kasd 20501.00 2 Non-NetWare Yes 0
>> .sunfas01.SUN.kasd 20219.15 2 Non-NetWare Yes 0
>> .wesfas01.WES.kasd 20219.15 2 Non-NetWare Yes 0
>> .KHSNW1.KHS.kasd 10553.93 2 Primary Yes 0
>> .janfas01.JAN.kasd 20219.15 2 Non-NetWare Yes 0
>> .admfas01.ADM.kasd 20501.00 2 Non-NetWare Yes 0
>> .khsfas01.KHS.kasd 20501.00 0 Non-NetWare Yes 0
>> .khsfas03.ZEN.kasd 20501.00 0 Non-NetWare Yes 0
>> .khsfas02.KHS.kasd 20501.00 0 Non-NetWare Yes 0

Unless you have another Primary and a Reference somewhere that this isn't
showing, that KHSNW1 server should be changed to a Single Reference.

I don't think that will change your symptoms any, though.

--
---------------------------------------------------------------------------
David Gersic dgersic_@_niu.edu
Novell Knowledge Partner http://forums.novell.com

Please post questions in the newsgroups. No support provided via email.

you...@hotmail.com

unread,

Dec 4, 2009, 1:46:08 PM12/4/09

to

How many CPU's have you assigned to each of these VM's?

"ryamry" <rya...@no-mx.forums.novell.com> wrote in message
news:ryamry...@no-mx.forums.novell.com...
> Everything is perfect with that. The nic settings are good too. It is
> a 1gb NIC, but these servers are in a VM cluster. There was a potential
> bottleneck there with the network traffic, but we took care of most of
> it and it should be better today when we do more work on it (put in
> another quad port nic in each VM host).

a...@novell.com

unread,

Dec 4, 2009, 2:06:45 PM12/4/09

to

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Give ndslogin a shot... it's trivial to use and, run on one or more
servers, should give some kind of idea how a login works without any other
pieces in the mix (login scripts, Zen, other client issues, networking
fun, etc.).

ndslogin admin.context.goes.here

Try it a few times on a few different servers and see what shows up.
Also, which process or processes (if any) are using up the CPU on these VM
boxes? How many VMs do you have on the VM hosts and what are they doing?

Good luck.

ryamry wrote:

> a...@novell.com;1898751 Wrote:
>> I'd start by getting a LAN trace of the login taking forever to see
>> where
>> the delays are. Does a login via something like ndslogin or dsrepair
>> work
>> on the servers with replicas themselves? It sounds like it may just be
>> a
>> client issue or a combination of services overwhelming the systems.
>>
>> Good luck.
>

> I have traces of the bad/failed logins. 'The Files are here.'
> (http://www.adrive.com/public/7c5acda8e55d6193bd5e6d377fd728e402c307567dbdff09c52ed664b1912fd5.html)
>
> In the "third row" capture, filter for ip address 10.3.3.206. This was
> one that took quite a long time for the NAL to show its icons, about 10
> minutes.
>
> Also, in the "22 and 24" capture file, the problem computer was
> 10.3.3.238. This one took a very long time after entering credentials
> for the login screen to go away (and start the login process).
> Eventually the computer got an pop up message stating the connection was
> reset. The kid tried to login in again. It took a VERY long time and
> when the NAL finally showed up, most of the icons were missing. This kid
> lost the whole class period sitting and waiting for the computer to try
> and finish logging in.
>
> People can login. Its just its mostly super slow. Sometimes a few
> will login in about a minute. This is also wide spread in our district
> - happening everywhere.
>
> ndsrepair shows no errors. I have not tried to login on the server
> console via ndslogin.
>
> A quick read through 'the previous post, here, '
> (http://forums.novell.com/novell-product-support-forums/zenworks/desktop-management/zw-desktop-management-7x/zdm7-install-setup/394462-logins-painfully-slow-post1897869.html#poststop)might
> be of help to you as well.
>
> Anything you can suggest at this point would be helpful!

>
>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQIcBAEBAgAGBQJLGV3EAAoJEF+XTK08PnB5qjgQANUbOGGYvE1xwKiSXeTl+lJA
4WJIgrxUbkxNxYK0euQcnL91O3w8tLDf8T06F9oW6vnPDfvr9dUVCBX14UertFrZ
HOn+ItvSuwKQL2jaTA6JTGESXvoypzzktVRLcxACoQWG8nVnXQ9tR+Zn/Zx+zDnh
TUOc3vMoW3cVzvu2SK9388LGjl3iFv98TGn+Y4gurRspp8X8AMNahxOnmHHivWAN
PGx+4RaEiUz8sTabsIz2Kv2plVzOGcpXECiOmSk6rTD/x/ga7DEypwSTBPNS/Spr
OJQONsRfjrGGigCBs+kQEDhATdpO1dyv7TPltdMSWc+Kp0T5Bs6lq0dl14wXhQ83
wLIgFfNGw4MAWvh7qxnHE4ZLB4IBVHsvmK5p8x5KuaY145byO4pNIPXx/mnrqea7
kLesUaEcKdRZZt521A/0cJFZB5GookHhcdXB3kFWXdCVhUTCk1/bv5AWsWmfke5G
sMgxYduD5sh9uQiVt9ILrAjtsg8FD7VDs4pdqyVBYQpmSHG1oCKX1i2MqRObmjrW
rcwB+HUxU2sxY0u00bcz8ZidZb5f6PJN2OJnxornrU3zaqV8QmtvTO96zL31m52Z
bjD6ymz3KhObHBAj5FgAg48zZ0Adi8dwLHmEV6v626i3h7g/vmU8FAQ/k8c6RM4S
TNxLLTijp9QZecdt4W6M
=bhZf
-----END PGP SIGNATURE-----

David Gersic

unread,

Dec 4, 2009, 3:14:28 PM12/4/09

to

On Fri, 04 Dec 2009 18:26:01 +0000, ryamry wrote:

> NW1 is our only Netware server and will be removed from the tree as soon
> as NetStorage is moved to a different server.

Then it must be changed to a Single Reference (type=single). Your
timesync configuration is currently not valid.

a...@novell.com

unread,

Dec 4, 2009, 4:32:30 PM12/4/09

to

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

If you are taking seconds to complete a single bind from ndslogin then
something is definitely wrong either with the VM or eDirectory, though
that's probably a no-brainer. While doing this login turn on ndstrace and
see what you get from the following flags:

ndstrace
set dstrace=nodebug
ndstrace +time +tags +auth +nmas +ldap
dstrace file on
set dstrace=*r
<perform ndslogin test in another shell>
dstrace file off
quit

Attach the (by default) /var/opt/novell/eDirectory/log/ndstrace.log file
here. On a system doing nothing else this should take about eight lines
of AUTH-tagged data plus quite a few more of NMAS. If your system is busy
doing a million other things (as we'll hopefully see) we should see all of
those other things if they require authentication or LDAP at all.

When you stated that 'ndsd' is one of the top processes what are you
meaning for utilization? 10%? 50%? 99%? Is it constantly at that level
or fluctuating up and down like owcimomd does?

Does the slowness on some of these boxes decrease if you turn off other VMs?

Good luck.

ryamry wrote:

> a...@novell.com;1900111 Wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> Give ndslogin a shot... it's trivial to use and, run on one or more
>> servers, should give some kind of idea how a login works without any
>> other
>> pieces in the mix (login scripts, Zen, other client issues, networking
>> fun, etc.).
>>
>> ndslogin admin.context.goes.here
>>
>> Try it a few times on a few different servers and see what shows up.
>> Also, which process or processes (if any) are using up the CPU on these
>> VM
>> boxes? How many VMs do you have on the VM hosts and what are they
>> doing?
>>
>> Good luck.
>

> I ran ndslogin on the three main servers here:
>
> KHSFAS01: ~ 5 seconds to the password prompt, then *65 seconds* to
> complete.
> KHSFAS02 - master replica server: about the same to the password
> prompt, then 25 seconds to complete.
> KHSFAS02 - zen server: instant.
>
> The servers arent under much load. top says about 1.0 or less. ndsd
> and owcimomd are the two that are usually on the top. Moreso ndsd than
> owcimomd.
>
> We have a total of 20 hosts on our VM cluster. Most of the other ones
> are windows servers that run apps. Nothing too intensive.

>
>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQIcBAEBAgAGBQJLGX/tAAoJEF+XTK08PnB57vcQAIdhZKuqBDKwm2S8iIGcUjDD
nJHXgjs0uYpRY/xFzEWzevkyOYnckrmyHp2Q+VpUx/kqH9kzmjpS6PoaosKKgwXo
JA2rQrls3byeeF4UfJl2E9m06iCa4jzDocE/4CS/r7JvJuBQug3P5vWSwuOzMPNQ
cntUZrnC36DIx9LZSQS1NH3bLFRANhA9BTXFHYcxVu8WUDlRHfcA6g7vX11JLhOF
jBcLRRw9+r0m3oZWSgfpIlm7I5e0OsuOis/S5WFNnS3HDL3/GzZtHHWbiVGc8LhF
pOaRaMfJdhqb2fHYMd+ufJzGRHi/0n+K2zr5QpjhmezzAVeA/PTTCxMNPACM9BLg
I+ql+7UgvAPoA+bcZNAXpH1WE628uNGXwiCZq43006PN5J+J5K1w+d9qMGwFzuyM
4uBl2yooE37izJ6Mgl15q8E6oOzsi6n31EJZ5w58/w+oitrTTK7Msts7T/HymLPx
nB5kNRsgaS5k9sKp+VLSyrSO7NyK6d/LebkGqDfT6rFrHJF3uyRgdzMbHCX3/K+3
Lm21nYKuQ8n3S7KPdo+3RBzVcIj3T+0Dph5T22wTBdXj7lvj5b6WEAIKMjoxV5Go
9M9AiAyYgztpWbVSNirUDNKxEg1p/zCF6lHRThznOcD/Nz97jyxdwNVQEvMWkE5X
XsURmJ1CergdYuGMlOOV
=tUfn
-----END PGP SIGNATURE-----

Peter Kuo

unread,

Dec 4, 2009, 6:51:10 PM12/4/09

to

It may be totally off in the far left field, but have you by chance check
for possible duplicate IP addresses?

--

Peter
eDirectory Rules!
http://www.DreamLAN.com

Massimo Rosen

unread,

Dec 7, 2009, 6:36:48 PM12/7/09

to

Hi,

ryamry wrote:
>
> After that last msg we started troubleshooting more. Even tho eDir
> said it was happy, when we started forcing some partition operations 626
> errors appeared. I had a consultant working on me with this and we
> removed all the replicas on that server using the xk2 xk3 switches.

That's like dropping a bomb on your gas station because you forgot to
fill up your car.

And *what* partition operations? Seriously, here and in the thread in
the Zen forum, you only supply information fragments. There is no "do
this, and your issue will go away" solution for your problem, it is by
far too complex for that. I've tried to extensively outline the possible
pitfalls over in the Zen forums, but I'll repeat it here again so that
the eDir specialists can catch up.

Your core issue is that eDir responds slow, and sometimes not at all, to
rather simple edir requests which in returns leads to long login times
when Zen is involved, simply because Zen by nature produces *a lot* of
eDir requests. You also stated your login times are ok after hours,
which further proves you face a load issue.

That issue can be:

1. Plain and simple *local* overload of the server not or not timely
responding. CPU, I/O, Memory as well as eDir *and Zen* configuration all
can be bottlenecks here. This needs to be verified, e.g you need to
identify the real load on your eDir, and establish an idea if that's a
load it should be able to handle, or if it's way too much.

2. Network issue between the improperly responding server, and another
server that needs to be contacted by the server for requested data. In
your traces, the failing requests all have been for groups. I already
explained that if any part of the group, like for instance member
objects, do not exist on the local server (he doesn't hold a replica of
the group member), or vice versa, *and* your server has a problem
contacting whatever remote server that *does* hold that data, your issue
can occur. A -626 error *can* point into that same direction, but
without knowing anything about what request that has been, and if not
possibly the -626 was just yet another symptom of the same problem (and
not the reason), it is *extremely* difficult to help from here.

And quite frankly, your consultant apparently only knows the
sledgehammer approach. A -xk2 and -xk3 to "solve" a -626 is about the
worst case of cracking a nut with a sledgehammer I've heard in a long
time (possibly short of simply fdisking the whole server). I'm afraid
with that approach there'll never be a solution.

David Gersic

unread,

Dec 8, 2009, 1:14:23 AM12/8/09

to

On Mon, 07 Dec 2009 22:36:02 +0000, ryamry wrote:

> ndsd and owcimomd using go back and forth between 4 and 15% utilzation,
> tops. Nothing to worry about in my eyes. The server nor VM hosts arent
> overloaded at all looking from performance chart

I'd be interested to see if you're being bottlenecked at the disk, not at
the CPU. Go in to iMonitor and click on the Agent Activity link. At the
top should be "DIB Writer Info". What are you seeing there?

Massimo Rosen

unread,

Dec 8, 2009, 10:33:07 AM12/8/09

to

Hi,

ryamry wrote:
>
> Im trying to supply as much information as possible. Im sorry if Im
> leaving things out. Anyways, the partitions ops that was being forced
> were just partition synchronization ops. He was running those and
> watching the traces for errors and thats when the -626 errors showed
> face, AFAIK.

Aha. Have you looked at TID3911653 for instance? A -626 should raise
several red flags, in a properly setup and working tree, it should
basically never happen. But it's difficult to tell without knowing
details about your tree to tell what's up. It could for instance be a
name resolution issue, and/or an issue with the list of known eDir
servers in the eDir DB of the server showing the outbound -626 (e.g, it
attempted to connect to another server, but couldn't find a network
address for it, that would trigger a -626). That said, there's an
increasing probabbility that your issue comes from inter-server
communication problems. Either because some links between servers are
overloaded, name resolution fails, or because some servers are so
overloaded that they fail on external comms too.

Anyways, simply dropping a nuke on it and then rebuild it and hoping
that would solve it is pretty pointless. *Even* if it helps for whatever
reason temporarily, it usually comes back (as you see), and you've
gained nothing, especially no knowledge *why* the problem occured, and
how to preevent it in the future.

> CPU, Disk I/O, memory, network, etc all seem okay.

"Seem" really isn't good enough.

> Its gotta be
> something with eDir, as you mentioned before, as its affecting the whole
> district...not just one location or one server.

It doesn't need to be "something" with eDir, i.e a bug or something. It
can as well "just" be bad design. Even if I repeat myself, one of the
prime reasons for such a behaviour you see is membership of anything
that can take members across partition boundaries. I've seen that
countless times, *especially* with Zen. Application being distributed by
groups, or by directly adding lots of individual users to them, and
these groups or apps containing members from all across the tree. And to
add insult to injury, the replicas of those partitions often only
existed across slow WAN links. That's a perfect recipe to kill eDir
performance when accessing these groups. But this is all speculation,
one would have to see your tree, the groups that fail, get in general
*way* more information about everything going on on your site.