eDir Health

Travis Montgomery

unread,

May 17, 2006, 11:32:55 AM5/17/06

to

A few months ago our eDir tree suffered through the loss of the root and
several child partitions (long story, ask me off line if you really want
to know....). With the help of Novell Tech support we recreated the
root partition from some old nds files and recent backup tapes. The
next few weeks were spent fighting many NDS issues (object collisions,
unknown objects, corrupted objects, sync issues, etc) and merging the
20+ existing partitions into one (again, long story).

At this point I believe the tree is mostly healed however I'm concerned
there may be some lingering issues. Yesterday I had to restart one of
our servers, when it came back up DHCP services failed to load. After
some troubleshooting we determined that two of the subnet objects had
become corrupted. Manually deleting and recreating them solved the
problem. What concerns me is that the service had been restarted since
our eDir catastrophe so I know the corruption was not caused at that
time. Since the server had been running for about 6 weeks it had to
have happened more recently. I know it's not likely the two events are
related but since I'm not fully confident in the stability of the tree I
want to make sure that eDir doesn't have an underlying issue(s) I'm not
aware of.

I followed TID 10060600 and performed an eDirectory Health Check. It
looked mostly ok but there were a few results that I don't know how to
read. We are running eDir 87.3.7 (10552.79) on Netware 6.5 sp4/5
servers. We have a single partition with 1 master and 3 R/W replicas.
Here are the results, and some questions:

1) NDS Version: All Servers reported 1552.79
2) Time Sync: All servers in sync
3) Server to Server Sync: This screen was continuously scrolling, I saw
many All Processed = Yes go by. I'm concerned because I also saw a
large number of "Failed, replica in skulk (-698)" errors. On one server
10 or more went by before the screen starting moving again and we got to
all processed.
4) Replica Synchronization: All servers reported all replicas in sync
with no errors.
5) External References: I ran this using the dsrepair -a option so I
could also view obituaries. Since there is only one partition there
were no external references reported however there were a large number
of obits. Server 1 (Master): 0 External References, 24 obits (12
unprocessed, 12 purgeable). Server 2 (RW): 0 exref, 70 obits (40
unprocessed, 19 purgeable, 11 ok to purge). Server 3 (RW): 0 exref, 47
obits (22 unpro, 14 purgeable, 11 ok to purge). Server 4 (RW): 0
exref, 33 obits (20 unpro, 13 purgeable)
6) Replica States: All servers reported correctly
7) Schema Sync: All servers report all processed = yes
8) Repair local Database: Have not run yet

Looking at the results above I'm concerned about Step 3 and Step 5. The
knowledge base said the -698 error are generally cosmetic but I got the
idea that it should not happen as often as it did here. Any reason to
be concerned? As for obits, I don't know much about them but their
sheer number seems to imply problems. Again, any reason to be concerned?

Concerning step 8 the TID says: "If while following the above outlined
Health Check Procedures you encounter DS errors or if you suspect
problems with a server's DS database, the Repair Local Database option
within DSREPAIR is a valuable tool to check a server's DS database."
Does this mean it's safe to run at any time? If not, how do I tell if I
need to run it? Do the results above suggest it's needed?

I do have a few Novell Incidents available. Would it be worthwhile to
spend one on their Directory team to do a full eDir physical? I want to
make sure that the tree is completely healthy and didn't know if they'd
be able to do a more complete check than I could. Are there any other
steps I can take to diagnose the situation further? Am I just being
paranoid?

Thanks in advance for all your help!

Travis

Niclas Ekstedt

unread,

May 17, 2006, 4:12:57 PM5/17/06

to

On Wed, 17 May 2006 15:32:55 +0000, Travis Montgomery wrote:

> 3) Server to Server Sync: This screen was continuously scrolling, I saw
> many All Processed = Yes go by. I'm concerned because I also saw a large
> number of "Failed, replica in skulk (-698)" errors. On one server 10 or
> more went by before the screen starting moving again and we got to all
> processed.

Well 698 errors are normal. It basically tells you that this server is
currently busy synchronizing with another server. For example, Server
starts an outbound synch with Server B. While this synch is in progress,
Server C tries to start an outbound synch with server A, In this case
Server A will reply to Server C with a 698 error. Server C will then wait
before retrying the synch with Server A.
The same would be true if Server C in the above example were to start an
outbound synch with Server B, Server B would then reply with a 698 error.
So 698's are perfectly normal unless they are returned at all times. In
this case they would show up in the Report Synchronization Status and
would need some troubleshooting.

>
> 5) External References: I ran this using the dsrepair -a option so I
> could also view obituaries. Since there is only one partition there
> were no external references reported however there were a large number
> of obits. Server 1 (Master): 0 External References, 24 obits (12
> unprocessed, 12 purgeable). Server 2 (RW): 0 exref, 70 obits (40
> unprocessed, 19 purgeable, 11 ok to purge). Server 3 (RW): 0 exref, 47
> obits (22 unpro, 14 purgeable, 11 ok to purge). Server 4 (RW): 0
> exref, 33 obits (20 unpro, 13 purgeable)

This sounds a bit much. You need to pay some closer look at this, running
the check at a later time and see if these obits are still hanging or if
they are progressing.
With eDir 8.7.3.x obituary processing are usually very fast. Especially in
an environment with just 4 servers.

> Concerning step 8 the TID says: "If while following the above outlined
> Health Check Procedures you encounter DS errors or if you suspect
> problems with a server's DS database, the Repair Local Database option
> within DSREPAIR is a valuable tool to check a server's DS database."
> Does this mean it's safe to run at any time?

Nope, never just run a repair on a regular basis. Never run a repair
unless you have an error in your database that you have researched and
come to the conclusion that a repair is needed. Then only run the repair
with the switches needed to fix the problem.

>If not, how do I tell if I
> need to run it? Do the results above suggest it's needed?

At this time no. The obits might requie a repair, but lets check upon that
through doing another check run to see if the same obits are still there
and they're clearly not progressing.
Whenever you do a health check that turns up errors and you're uncertain
what to do, just ask in here and we'll be happy to help you.

> I do have a few Novell Incidents available. Would it be worthwhile to
> spend one on their Directory team to do a full eDir physical? I want to
> make sure that the tree is completely healthy and didn't know if they'd
> be able to do a more complete check than I could. Are there any other
> steps I can take to diagnose the situation further? Am I just being
> paranoid?

Perhaps, at least you get another pair of eyes looking at the patient.
At the times another pair of eyes can be very useful, especially when
belonging to someone who are used to looking at eDirectory patients ;-)
Right now I wouldn't waste any incident on this one though until we've
had a second look at the obituaries.

--
___________________________________________
Niclas Ekstedt, CNA/CNE/CNS/CLS
Network Consultant/NSC Sysop
InfraSystems Solutions AB

David Gersic

unread,

May 17, 2006, 5:33:20 PM5/17/06

to

On Wed, 17 May 2006 15:32:55 GMT, Travis Montgomery
<tmontgomery@_remove_.nccu.edu> wrote:

>3) Server to Server Sync: This screen was continuously scrolling, I saw
>many All Processed = Yes go by. I'm concerned because I also saw a
>large number of "Failed, replica in skulk (-698)" errors. On one server
>10 or more went by before the screen starting moving again and we got to
>all processed.

Replica in Skulk means, essentially, that this server tried to sync to another
server, and the other server replied with the equivilent of "I'm busy talking to
another server right now, please come back later".

>5) External References: I ran this using the dsrepair -a option so I
>could also view obituaries. Since there is only one partition there
>were no external references reported however there were a large number
>of obits.

You should clean these up. There are TIDs on how to deal with stuck obits.

>8) Repair local Database: Have not run yet

Probably wouldn't hurt, but I wouldn't do it yet.

>Concerning step 8 the TID says: "If while following the above outlined
>Health Check Procedures you encounter DS errors or if you suspect
>problems with a server's DS database, the Repair Local Database option
>within DSREPAIR is a valuable tool to check a server's DS database."
>Does this mean it's safe to run at any time? If not, how do I tell if I
>need to run it? Do the results above suggest it's needed?

Generally, it's usually safe to run, or at least it won't make things worse than
they already are, but it is a low level database repair utility and it is
designed to try to find and fix anything it thinks is wrong with the database.
It's not something to be afraid of, nor something to be treated lightly.

>I do have a few Novell Incidents available. Would it be worthwhile to
>spend one on their Directory team to do a full eDir physical?

I probably wouldn't, at least not for this, yet. Fix the stuck obits, then go
through and see how things are in the health check.

---------------------------------------------------------------------------
David Gersic dgersic_@_niu.edu

I'm tired of receiving rubbish in my mailbox, so the E-mail address is
munged to foil the junkmail bots. Humans will figure it out on their own.

Travis Montgomery

unread,

May 18, 2006, 1:32:16 PM5/18/06

to

Thanks Niclas! I re-checked everything today and had the same results
as far as the obits go. I don't think there is any coincidence that the
earliest date listed on the obits is the day we rebuilt the root
partition. I've done some TID reading on Obits, I'm sure if I had one
or two to deal with I could probably figure it out, but the sheer number
here makes me think it would be better to go ahead an use one of our
tickets.

Thanks,

Travis

Travis Montgomery

unread,

May 18, 2006, 1:33:56 PM5/18/06

to

Thanks David. I read a few of the TID's but I'm having trouble dealing
with the sheer volume of them. I think I'm going to throw in the towel
and get some help. Thanks.

Travis

Travis Montgomery

unread,

May 19, 2006, 8:31:47 AM5/19/06

to

Update:

Turned out to be the right decision to call Novell. When the directory
was rebuilt there were many object collisions which resulted in objects
in the tree with names like 0_5.OU.O. In the days following the rebuild
we went through and deleted many of these objects. One of the objects
reported itself as a server and one of our replicas believed it was
valid. That replica stopped processing obits as it kept trying to
contact the 0_5 server before it would process them. We ended up having
to remove the replica from this server (it wouldn't let us delete the
0_5 object via dsdump). We then had to manually remove the remaining
obits from the other three replicas. Once this was done a clean replica
was added back to the other server.

Thanks for all your help! I do appreciate it.

Travis

David Gersic

unread,

May 19, 2006, 11:19:38 AM5/19/06

to

On Fri, 19 May 2006 12:31:47 GMT, Travis Montgomery
<tmontgomery@_remove_.nccu.edu> wrote:

>Turned out to be the right decision to call Novell. When the directory
>was rebuilt there were many object collisions which resulted in objects
>in the tree with names like 0_5.OU.O. In the days following the rebuild
>we went through and deleted many of these objects. One of the objects
>reported itself as a server and one of our replicas believed it was
>valid.

Ugh. That's a pain.

>Thanks for all your help! I do appreciate it.

Glad to hear you got it sorted out. But you never did tell us how you got to
this state in the first place. Inquiring minds want to know.

Travis Montgomery

unread,

May 19, 2006, 11:54:40 AM5/19/06

to

> Glad to hear you got it sorted out. But you never did tell us how you got to
> this state in the first place. Inquiring minds want to know.

Ugh, Ok, but first a disclaimer: I had been on the job for less than
two weeks, when this happened.

It all started when we noticed a loud beeping coming from the server
room. A drive in our RAID had died. The previous administrator got on
the phone with the vendor to find out what happened, the said they
believed it was a raid controller firmware issue. We announced the
server would be down and the administrator followed the tech support
guy's instructions and brought the raid array back online. It seemed to
be working but was still a bit funky. So, to test it, the support tech
told the admin to fail one of the good drives (not the one that had
failed before) to make sure it was working. Believe it or not, the
admin did it, and not surprsingly the raid was toast. This is where I
came in, as the announced support window was approaching the end. They
were concerned as this was our DHCP server. I said, "no problem, DHCP
is stored in eDir so I can just move the service to another server". I
went back to my office and discovered that I couldn't log in, to
anything. I tried every IP address to every NetWare server I knew about
with no luck. I went back to the server room and asked where the other
replicas were. They told me, I checked and they werent' there. The
only server with a copy of the root partition was the one they had just
hosed. Apparently, because there were over 20 partitions (on a tree
with no wan links and about 14000 ojects, go figure) they lost track of
the various replica rings. 7 partitions had their only copies on this
server. A few servers had been retired over the past few years and the
replicas went with them. We tried using the RAID utilities to rebuild
but the Sys volume wouldn't mount. I ran a pool rebuild and was told
that 47 files were corrupted. The volume mounted but eDir was hosed.
The rest is history.....

Edward van der Maas

unread,

May 20, 2006, 3:00:23 AM5/20/06

to

Travis Montgomery wrote:

> It all started when we noticed a loud beeping coming from the server
> room. A drive in our RAID had died. The previous administrator got
> on the phone with the vendor to find out what happened, the said they
> believed it was a raid controller firmware issue. We announced the
> server would be down and the administrator followed the tech support
> guy's instructions and brought the raid array back online. It seemed
> to be working but was still a bit funky. So, to test it, the support
> tech told the admin to fail one of the good drives (not the one that
> had failed before) to make sure it was working. Believe it or not,
> the admin did it, and not surprsingly the raid was toast.

Did they sack that tech ?

> went with them. We tried using the RAID utilities to rebuild but the
> Sys volume wouldn't mount. I ran a pool rebuild and was told that 47
> files were corrupted. The volume mounted but eDir was hosed. The
> rest is history.....

And what happend to your backup ?

--
Cheers,
Edward

Niclas Ekstedt

unread,

May 20, 2006, 3:11:57 PM5/20/06

to

On Fri, 19 May 2006 15:54:40 +0000, Travis Montgomery wrote:

Ouch. OK, now might be a good time to design a good disaster recovery plan
and thoroughly test it. Believe me your boss can't say no, not after this
history.

Travis Montgomery

unread,

May 22, 2006, 10:49:45 AM5/22/06

to

Yep, that was my suggestion..... but I make lots of suggestions....

Travis Montgomery

unread,

May 22, 2006, 10:50:43 AM5/22/06

to

> Did they sack that tech ?

Ours? No. Theirs? I have no idea

> And what happend to your backup ?

Don't ask.

David Gersic

unread,

May 22, 2006, 4:22:25 PM5/22/06

to

On Fri, 19 May 2006 15:54:40 GMT, Travis Montgomery
<tmontgomery@_remove_.nccu.edu> wrote:

>It all started when we noticed a loud beeping coming from the server
>room. A drive in our RAID had died. The previous administrator got on
>the phone with the vendor to find out what happened, the said they
>believed it was a raid controller firmware issue. We announced the
>server would be down and the administrator followed the tech support
>guy's instructions and brought the raid array back online. It seemed to
>be working but was still a bit funky. So, to test it, the support tech
>told the admin to fail one of the good drives (not the one that had
>failed before) to make sure it was working.

Ouch.

Let me guess, you were working with Dull support?

>with no luck. I went back to the server room and asked where the other
>replicas were. They told me, I checked and they werent' there. The
>only server with a copy of the root partition was the one they had just
>hosed. Apparently, because there were over 20 partitions (on a tree
>with no wan links and about 14000 ojects, go figure) they lost track of
>the various replica rings. 7 partitions had their only copies on this
>server.

Ouch. Well now's a good time to take an inventory of what your partitions are,
and where they're stored. Good DR planning would be important, too, now that
you've seen the results of _not_ managing eDirectory well.