Re: Logins PAINFULLY Slow!

Massimo Rosen

unread,

Dec 1, 2009, 10:53:47 AM12/1/09

to

Hi,

That whole description sound like serious underlying network
connectivity issues.

However, the lan traces *must* show something. Want to upload them
somewhere for us to have a look?

ryamry wrote:
>
> Our logins here have been ridiculously slow lately. Its not always
> 100%, but kinda intermittent. Sometimes it takes upwards of 15 minutes
> to show the icons in the NAL after the splash screen shows. On top of
> that, a few workstations (not always the same ones) will not map any of
> their drives, reporting errrors such as "An unexpected error has
> occurred: 15 (8819)", 8804 errors, and a new one I got today was 897c
> when trying to map a drive.
>
> Im more concerned with how long its taking for the NAL to load though.
> Sometimes kids cant get logged in during class time its taking so long.
> Sometime when they do finally get it loaded its missing half the icons
> and in offline mode. Changing it to online and doing a refresh on it
> doesnt always bring all their icons up.
>
> AFAIK, the policies are all configured correctly. Ive done packet
> traces and it doesnt look like tree walking is going on. SLP is
> working, DNS, etc. Im not sure what else to check. Everybody is getting
> really worked up here by how badly stuff is running lately. Any help
> would be greatly appreciated.
>
> We are mostly SLES10 SP3 with OES2 SP2. ZDML 7 SP1 IR4 on the server.
>
> Thanks!!!
>
> -Rob
>
> --
> ryamry
> ------------------------------------------------------------------------
> ryamry's Profile: http://forums.novell.com/member.php?userid=154
> View this thread: http://forums.novell.com/showthread.php?t=394462

--
Massimo Rosen
Novell Product Support Forum Sysop
No emails please!
http://www.cfc-it.de

Massimo Rosen

unread,

Dec 1, 2009, 8:21:38 PM12/1/09

to

Hi,

ryamry wrote:

>
> mrosen;1897893 Wrote:
> > Hi,
> >
> > That whole description sound like serious underlying network
> > connectivity issues.
> >
> > However, the lan traces *must* show something. Want to upload them
> > somewhere for us to have a look?
>

> Yes, that would be great. 'The Files are here.'
> (http://www.adrive.com/public/7c5acda8e55d6193bd5e6d377fd728e402c307567dbdff09c52ed664b1912fd5.html)
>
> In the "third row" capture, filter for ip address 10.3.3.206. This was
> one that took quite a long time for the NAL to show its icons.

Well, there's *a lot* of weird crap going over the wire, and plenty,
plenty room for all sort of optimization.

But on a quick skim, there's apparently a problem at least with an edir
object called "GRPUSER-JRG-GradYr20" in your tree. The client attempts
to read it from 10.1.2.22, but never receives a response from the
server, so eventually resets the connection. The most common scenario
for this behaviour is, when the server the client talks to doesn't have
a replica of that object, and that server in turn cannot talk
succesfully to another server that has it.

So this really looks like a rather nasty eDir sync issue. How many
servers? Does 10.1.2.22 hold a replica of this object? If not, which
server has one?

CU,

Massimo Rosen

unread,

Dec 1, 2009, 8:26:07 PM12/1/09

to

Hi.

Oh, and another thing: After the client has eventually timed out waiting
for the server 10.1.2.22 to respond to it's request, it rests the whole
connection to that server, then reconnects to 10.1.2.23, and starts from
scratch, this time succesfully. So the core problem is certainly
10.1.2.22, or that servers connection to the other servers.

CU,

Massimo Rosen

unread,

Dec 2, 2009, 4:32:18 AM12/2/09

to

Hi,

ryamry wrote:
>
> The "third row" was a trace of about 10 computers. The lab also has
> some software in there that allows the teacher to control the
> workstations. We also have some monitoring agents on the
> computers...maybe that is the excessive traffic you are seeing?

It's not excessive traffic, I wouldn't have digged so deep. But there's
a whole lot of *unsuccessful* traffic attempts, and that's always a
problem. And I was strictly looking at the traffic of the one WS only.

> The server does has a replica of that object. 10.1.2.22 is our main
> file server which would explain all our problems if that server is the
> root of it all. 10.1.2.23 is our zenworks server, which also holds a
> replica of the JRG partition.

Ok. Do the two hold *exactly* the same replicas? As this appears to be a
group (right?), the core problem could be not with the group object
itself, but possibly with one of it's members, which in return could be
in a remote replica.

> No errors show when running ndsrepair on any of our servers.

With what options? Not every issue is immediately visible. Have you
checked with iMonitor?

> We have
> had some timesync issues last week (after upgrading to SLES10 Sp3 OES2
> SP2 and having the kernel parameters for timekeeping in VMware written
> over), but that has since been fixed.

I assume the time was running slow, or did it go into the future?

> After I read your last post I took the replicas off the 10.1.2.22
> server and put them back on; thinking that the timesync issues last week
> may have messed something up. What do you suggest to troubleshoot this
> out? I appreciate your help!!!

So removing and readding the replica didn't help?

Massimo Rosen

unread,

Dec 2, 2009, 8:21:02 AM12/2/09

to

Hi,

ryamry wrote:
>
> Also, in the "22 and 24" capture file, the problem computer was
> 10.3.3.238. This one took a very long time after entering credentials
> for the login screen to go away (and start the login process).
> Eventually the computer got an error stating the connection was reset.

Basically the same problem in this trace. This time, the client is
asking 10.1.2.22 for the public key, and the server fails to answer at
all. There's something seriously wrong with that server, it's eDir, or
it's connectivity to the rest of the tree.

In addition to these immediately visible problems, your overall
eDirectory performance (in terms of response times even to working eDir
requests) is truly atrocious, and not only from the server at 10.1.2.22,
but basically all eDir servers. Response times of often well over one
second for simple requests for e.eg a single attribute of a given object
are way unnormal, and especially with Zenworks (which does hundreds of
such requests), also add up to the overall poor performance, where even
"working" logins take several minutes until they're complete.

CU,

Massimo Rosen

unread,

Dec 2, 2009, 9:07:15 AM12/2/09

to

Hi,

ryamry wrote:
>
> As for the packets where .22 isnt responding to the query for the
> GRPUSR-JRG-GradYr object, do you have some packet numbers that I can
> look at so I know what to look for in the future?

Not at the moment, sorry. If you use wireshark, you should be able to do
a text search for the object name though.

> Am I right in
> assuming that all these unsuccessful packet attempts are likely related
> to this issue we are currently having?

No. There's an extremely high amount of unsuccessful connection attempts
in the trace to other services. Not Novell or eDir related at all. But
each of these attempts takes time.

> Yes, the .22 and .23 have -exactly- the same replicas - same types and
> everything. The GRPUSR-JRG-GradYr2014 object is a group. All members
> of that group are located on the same partition.

In that case, if a thorough dsrepair doesn't find anything, something
must be quite seriously wrong with that server. It may be useful in that
case if you'd post this in the matching eDirectory group for further
tips on how to possibly debug this. At any rate, the issue is clearly
visible, the client asks the server for that group, and the server plain
and simple refuses to answer, until the client eventually times out,
resets the whole eDir connection to that server, and then retries (with
a different server).

Massimo Rosen

unread,

Dec 4, 2009, 12:50:07 PM12/4/09

to

Hi,

ryamry wrote:
>
> This still seems like its a ZEN issue.

No. It's absolutely 100% not. Zen *may* be the trigger, because it puts
additional load on your eDirectory. But the issue is that your eDir
can't cope with that load, and it should have absolutely no problem
doing so.

>
> Login with everything as it was (agent ZEN7 SP1 IR3a) took about 5-6
> minutes to login.

That's too long.

> With Workstation Manager service disabled and stopped it was about the
> same time until the icons appeared in the NAL.

That's still too long. *Way* too long. I don't accept anything longer
than max 45 seconds from hitting enter in the client, and full
availability of the desktop and NAL as viable. Of course, if you have a
whole boatload of applications, that could be more than that, but not
much.

> With the agent uninstalled it was about 20-30 seconds for the login
> process to complete.

Of course. Although, that's *still* too long. That dependd on the
definition of "login process" though. I define it as hitting enter on
the gina, to the end of the login script. That shouldn't take more than
10 seconds max. Producing the desktop is of course a windows issue.

> I then installed the IR4a HP2 agent and tried. It took 23 minutes to
> log in.

Yes, because your eDir performance is atrocious, and logging in with or
without Zen is a difference in eDir requests like night and day. The
mere login itself produces maybe 1% (if at all) of the eDir requests a
full login with Zen produces. Your environment can stand the 1%, but it
can't cope with the 100. That's why you only notice that something's
wrong when you enable Zen, because only in that acse you really use eDir
extensively.

> I have packet traces of all of these if it would help.

I don't think they're necessary, really. I have already identified the
problem, and that is cleary and really without doubt your eDirectory
performance, and in extreme cases, outright failure.

> We did find a potential network bottleneck with our VM cluster and have
> corrected most of it. .22 server is getting hammered on network traffic
> is runs at about 60-70 Mbps in the morning hours then it decreases a few
> hours later to around 20-25. We are putting in another quad port nic in
> each VM host tonight and making that bandwidth available to the guests.

Well, you may indeed simply overload your servers, that's a possibility.
But I doubt it's network traffic. If all or most these 60/70 Mbps are
indeed eDir traffic (and not files), then you're most certainly
overloading your eDir in terms of how many requests/s it can handle.
Increasing the badnwith won't help, but possibly make it even worse. You
need to increase the performance of the backend. eDir is a database, and
that's usualyl bottlenechked by CPU, memory, configuration, and local
storage I/O performance, but not network traffic. By simply sending
enough complex searches to a eDir backend, it can easily be blown out
the water with a single megabit of network traffic.

> But, everything is still slow. Any other suggestions would be
> helpful!

Seriously, you need to concentrate on your eDir performance. Either your
current setup/environment is simply not up to the task you put on it, or
something's not properly working, or not properly tuned. Again, it is
*not* Zen. Zen is merely showing the problem.

Massimo Rosen

unread,

Dec 4, 2009, 6:06:47 PM12/4/09

to

Hi,

ryamry wrote:
>
> One other thing of note is that at 3p when school is out, login times
> (gina to nal icons showing) is about 60-90 seconds, consistent.

Which very much proves that it's a case of eDir overload. YOu really
need to work with the experts over in the edir group to get help on how
to identify and solve the bottleneck, or possibly configuration issue
that causes your eDir to bark during daytime load. Your other symptoms
(C1 issues with objects disappearing or being dupes) are another direct
result of your issue. With eDir being overloaded on at least one server,
it will also not sync in a timely manner, and that causes these
symptoms.

Craig Wilson

unread,

Dec 5, 2009, 2:12:50 PM12/5/09

to

One "ZENworks" related item is to check the 'NAL Refresh' setting in all of
your containers.
The value is stored in seconds.
The default value is 12hours (43200).
Commonly it is a lower value such as 2-4 hours.

However, I have seen cases where by accident someone thinks the base value
is in minutes so they set it to something such as 120.
This leads to a massive pounding of edirectory as all clients constantly hit
the servers with refreshes and never stop since it takes more than 120
seconds to finish refreshing and the next starts the instant it finishes the
last refresh.

--
Craig Wilson - MCNE, MCSE, CCNA
Novell Knowledge Partner

Novell does not officially monitor these forums.

Suggestions/Opinions/Statements made by me are solely my own.
These thoughts may not be shared by either Novell or any rational human.

"Massimo Rosen" <mros...@spamcfc-it.de> wrote in message
news:4B199605...@spamcfc-it.de...