Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Pool Deactivation Problem

93 views
Skip to first unread message

Brad Johnson

unread,
Jun 1, 2004, 2:42:58 PM6/1/04
to

This morning I had a pool deactivate because of an I/O error. The server has
been in service only a few weeks. It is a NW 6.5 server and was installed
using the sp1.1b overlay. The server has an Intel SE7501HG2 motherboard with
a 3.06 MHz Zeon processor, Hyperthreading turned off in BIOS. The hard disk
controller is an Intel SRCU42X RAID controller using the Mega4_XX.Ham
driver. There are 2 Seagate 73GB hard disks attached to the controller in a
mirrored configuration. I had an NSS snapshot on this pool active at the
time of the error and deactivation. I received the following console
messages prior to the pool deactivating:

6-01-2004 9:38:28 am: COMN-3.20-34 [nmID=60010]

NSS-2.70-5005: Volume SRV2/DATA user data write

(20204(zio.c[2027])) to block 15792330(file block 507)(ZID 21955) failed.

6-01-2004 9:39:34 am: COMN-3.20-34 [nmID=60010]

NSS-2.70-5005: Volume SRV2/DATA user data write

(20204(zio.c[2027])) to block 15788435(file block 0)(ZID 519420) failed.

6-01-2004 9:39:55 am: COMN-3.20-1092 [nmID=A0025]

NSS-3.00-5001: Pool SRV2/MAIN_POOL is being deactivated.

An I/O error (20204(zio.c[2005])) at block 15408843(file block

-15408843)(ZID 6) has compromised pool integrity.

6-01-2004 9:39:55 am: COMN-3.20-33 [nmID=A0023]

NSS-2.70-5004: Volume SRV2/DATA is being deactivated.

An I/O error (20204(zio.c[2005])) at block 15408843(file block

-15408843)(ZID 6) has compromised volume integrity.

6-01-2004 9:49:03 am: COMN-3.20-1092 [nmID=A0025]

NSS-3.00-5001: Pool SRV2/MAIN_POOL is being deactivated.

An I/O error (20204(zio.c[2005])) at block 15120046(file block

-15120046)(ZID 2178) has compromised pool integrity.

6-01-2004 9:51:31 am: COMN-3.20-1092 [nmID=A0025]

NSS-3.00-5001: Pool SRV2/MAIN_POOL is being deactivated.

An I/O error (20204(zio.c[2005])) at block 15120046(file block

-15120046)(ZID 2178) has compromised pool integrity.

6-01-2004 9:54:43 am: COMN-3.20-1092 [nmID=A0025]

NSS-3.00-5001: Pool SRV2/MAIN_POOL is being deactivated.

An I/O error (20204(zio.c[2005])) at block 15120046(file block

-15120046)(ZID 2178) has compromised pool integrity.

6-01-2004 9:56:04 am: COMN-3.20-1092 [nmID=A0025]

NSS-3.00-5001: Pool SRV2/MAIN_POOL is being deactivated.

An I/O error (20204(zio.c[2005])) at block 15120046(file block

-15120046)(ZID 2178) has compromised pool integrity.

6-01-2004 9:59:19 am: COMN-3.20-1092 [nmID=A0025]

NSS-3.00-5001: Pool SRV2/MAIN_POOL is being deactivated.

An I/O error (20204(zio.c[2005])) at block 15120046(file block

-15120046)(ZID 2178) has compromised pool integrity.

The first time I ran the Pool Rebuild it did not complete without error and
put the pool into maintenance mode. The second time I ran Pool Rebuild it
completed successfully and I was able to remount the volumes.

Additional Notes: I used the Consolidation Utility to move the Data Volume
from Srv1 to Srv2. Srv1 is an older IBM Netfinity server with a RAID 5
configuration running NW 6.0 sp3. Prior to moving the Data volume to Srv2 I
was receiving an occasional user data write error on the Data volume on
Srv1, and a less frequent I/O error which would lead to a Pool Deactivation.
I assumed, perhaps wrongly, that this was a hardware related issue. And,
because I was planning the move to the new server, Srv2, I put up with the
occasional error assuming I would not have such problems on the new server.

How do I determine the underlying cause of the "user data write" and I/O
errors which led to the pool deactivation? How do I prevent this from
happening again? This event happened in the middle of a very busy morning, I
can not afford for this to happen again. I also have additional volumes
which are scheduled to be consolidated to this server. I can not proceed
until I have a handle on this pool deactivation issue.

Any help is appreciated.

Thanks

Brad Johnson

P.S. I also had a pool deactivation on one of my test servers. This was a NW
6.5 machine with an IDE hard disk. I posted a question about this server a
few days ago. I am obviously having some severe doubts about the
dependability of the NSS file system, but am not sure what to do about it. I
have a number of servers which I was planning to upgrade from NW6.0 to
NW6.5, but these pool deactivation issues are causing me to question whether
I should continue with the upgrades.


Portlock Software Support

unread,
Jun 1, 2004, 5:12:57 PM6/1/04
to
Brad,

NSS is very stable and reliable. However, any filesystem
is only as reliable as the underlying devices that hold the
filesystem.

The problem that you are experiencing are write errors
to your devices. You mention that you are using mirroring,
is this software or hardware mirroring?

If you are using hardware mirroring, then why is the driver
(or controller) reporting errors to the OS????

ZID 21955 is either a file or a directory.
ZID 519420 is either a file or a directory.
ZID 6 is the Name Tree (very important structure).

The design of NSS is to immediately disable a pool
when there are I/O errors.

1) How old are the disk drives?
2) Can you break the mirror so that you can test each drive?
3) Double check the firmware revision on the board.
4) Double check the driver version.
5) Is the hardware Novell Tested and Approved?

--

Portlock Software
Linux and NetWare storage management software
www.portlock.com
sup...@portlock.com

"Brad Johnson" <bradJ...@centurytel.net> wrote in message news:S64vc.2654$aF2...@prv-forum2.provo.novell.com...

Brad Johnson

unread,
Jun 2, 2004, 9:52:25 AM6/2/04
to
> is this software or hardware mirroring?
Hardware mirroring.

> why is the driver (or controller) reporting errors to the OS????

Good question. How do I find the answer?

> 1) How old are the disk drives?

Brand new

> 2) Can you break the mirror so that you can test each drive?

According to the capabilities listed in the documentation I should be able
to. But the documentation for actually doing this procedure are very
sparse. Thus I have reservations about tackling this process. Are there
testing utilities provided by Novell, or would I have to purchase third
party testing utilities?

> 3) Double check the firmware revision on the board.

The firmware is the latest version I could get as of a month ago.

> 4) Double check the driver version.

I have found the Intel has recently posted a newer version of the
Mega4_XX.Ham driver. The readme for the update does not mention any issues
with NSS in the "issues fixed" section.

> 5) Is the hardware Novell Tested and Approved?

The Intel SRCU42X RAID controller is Novell YES Tested and Approved for NW
6.0 and 6.5. There were no Seagate drives listed when I performed a search
for Yes Tested bulletins. I assume this means that seagate does not
routinely submit their drives for testing. I thought the listed of tested
drives was rather limited, so I assume most hard drives are not submitted
for Novell YES Testing.


> NSS is very stable and reliable.

I know that in general this is true. However, in my limited experience, NSS
has seemed more sensitive to hardware issues. I currently work with 12
Netware servers and 3 of them have recently had issues where NSS has
deactivated a pool. These servers span a range of ages and hardware
configurations. I like a lot of the features that NSS brings to the table,
but I have become very nervous about when the next NSS issue will send me
scrambling to get a server back up and running.

Thank you for your help,

Brad Johnson

Portlock Software Support

unread,
Jun 2, 2004, 5:49:29 PM6/2/04
to
Brad,

The issue with the drivers reporting errors to the
OS is an issue that you should talk to LSI about.

In my opinion, when you have hardware RAID,
no errors should be reported to the OS I/O layer
unless the error is unrecoverable. Systems software
should of course be able to read controller error
logs so that appropriate preventative steps can
be taken.

LSI should be able to provide you with controller
diagnostic software and testing tools to help you
isolate problems. For example, is the controller
or MB overheating?

Do you have the servers and storage devices connected
to a properly functioning UPS? Power surges are
notorious for messing up storage.

Don't worry too much about Yes testing on disk drives.
The controller companies should be doing their own
quality tests and recommending supported drives.

Other operating system designs often do not think
about what happens when lower level I/O fails and how
this can corrupt a file system. For example, there is
no point in updating the directory table for a new file
if the file does not exist because of an I/O problem.
There are probably as many philosophies for error
handling as there are errors, but in my experience doing
thousands of data recoveries, NSS is very strong
compared to other file systems such as NTFS, EXTx,
etc.

--

Portlock Software
Linux and NetWare storage management software
www.portlock.com
sup...@portlock.com

"Brad Johnson" <bradJ...@centurytel.net> wrote in message news:tYkvc.3393$aF2....@prv-forum2.provo.novell.com...

0 new messages