Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Lost ZFS zpool, Powermt claims the devices are fine

405 views
Skip to first unread message

Doug Freyburger

unread,
Jan 3, 2012, 6:30:14 PM1/3/12
to
Folks,

One of the hosts I support lost its ZFS zpool and thus it's Oracle
database, developer home directories and such. The host rebooted over
the holiday and the zpool did not come up.

The guy with the contact data for their Solaris support contract is on
vacation for two more days so I have time to try other sources to get
the developers working again until then so I am asking here. I have my
own Oracle metalink account so I was able to read the links in
/var/adm/messages but they say to restore from backups. Sure enough
the guy on vacation for two more days is also the one who has access
to the internal backup system at the remote site. Happy new year. ;^(

I tried "zpool export devstuff" to see if I could import it again -

#zpool status
no pools available
#zpool import
pool: devstuff
id: many-digits
state: UNAVAIL
status: One or more devices are missing from the system.
action: The pool cannot be imported. Attach the missing
devices and try again.
see: http://www.sun.com/msg/ZFS-8000-3C
config:

devstuff UNAVAIL insufficient replicas
emcpower0c UNAVAIL cannot open
emcpower1c UNAVAIL cannot open
emcpower2c UNAVAIL cannot open
emcpower3c UNAVAIL cannot open
#zpool import -f devstuff
cannot import 'devstuff': invalid vdev configuration
#powermt display dev=emcpower0c
Pseudo name=emcpower0a
CLARiiON ID=xxxx [SG_devstuff_0]
Logical device ID=600601602F10260012A02D3571EEDE11 [LUN 405]
state=alive; policy=BasicFailover; priority=0; queued-IOs=0
Owner: default=SP A, current=SP B
==============================================================================
---------------- Host --------------- - Stor - -- I/O Path - -- Stats ---
### HW Path I/O Paths Interf. Mode State Q-IOs Errors
==============================================================================
3072 pci@1f,2000/lpfc@1/fp@0,0 c1t5006016144600F0Cd0s0 SP A1 active alive 0 0
3072 pci@1f,2000/lpfc@1/fp@0,0 c1t5006016944600F0Cd0s0 SP B1 active alive 0 0

The other LUNs are also marked "active alive".

Any suggestions? I've seen a zpool go bad at boot time before, maybe a
timing problem with Powerpath or DMP. I could "zpool export devstuff"
and "zpool import devstuff" and it worked. But that was on a different
host in a different data center. This time it did not work.

Richard B. Gilbert

unread,
Jan 3, 2012, 8:36:10 PM1/3/12
to
If this were my system, I would have tested and documented "How to
rebuild" and/or "how to restore from backup!"

The problem description strongly suggests that something has corrupted
your file and/or disk structure. This sort of thing can make the O/S
quite grumpy! Bosses too!!!!

Good luck! I'm afraid you will need it!!


Andrew Gabriel

unread,
Jan 4, 2012, 5:28:55 AM1/4/12
to
In article <je0326$jg6$1...@dont-email.me>,
It looks like the LUNs vanished from the system, and still aren't visible.
Can you see the LUNs with fdisk or format? (Don't write to them).
If not, you need to fix the EMC setup and get them back.

--
Andrew Gabriel
[email address is not usable -- followup in the newsgroup]

Sami Ketola

unread,
Jan 4, 2012, 5:53:53 AM1/4/12
to
In comp.unix.solaris Doug Freyburger <dfre...@yahoo.com> wrote:
> config:
>
> devstuff UNAVAIL insufficient replicas
> emcpower0c UNAVAIL cannot open
> emcpower1c UNAVAIL cannot open
> emcpower2c UNAVAIL cannot open
> emcpower3c UNAVAIL cannot open

what does zdb -lv /dev/dsk/emcpower0c say? Can it find all 4 zfs labels?
and on the other device files?

can you find any valid lables on any devices in /dev/dsk?

> Any suggestions? I've seen a zpool go bad at boot time before, maybe a
> timing problem with Powerpath or DMP. I could "zpool export devstuff"
> and "zpool import devstuff" and it worked. But that was on a different
> host in a different data center. This time it did not work.

you can always try to rename /etc/zfs/zpool.cache reboot and import.

Sami

Doug Freyburger

unread,
Jan 4, 2012, 10:54:38 AM1/4/12
to
Richard B. Gilbert wrote:
> Doug Freyburger wrote:
>
>> One of the hosts I support lost its ZFS zpool and thus it's Oracle
>> database, developer home directories and such. The host rebooted over
>> the holiday and the zpool did not come up.
>
> If this were my system, I would have tested and documented "How to
> rebuild" and/or "how to restore from backup!"

Sure, have that. But the guy with the backup-restore access in that
data center comes back from vacation tomorrow. If I can actually solve
the problem in the meantime there will be no need to restore from
backups. Restoring from backups equals giving up on any data that is on
line..

> The problem description strongly suggests that something has corrupted
> your file and/or disk structure. This sort of thing can make the O/S
> quite grumpy! Bosses too!!!!

Right. If the data is corrupted I'll need to restore from backup. But
the partition tables are intact. What would trash the ZFS equivalent of
the inode tables at boot time without also trashing the partition
tables? I can think of a lot of stuff that would do that, none good.

Doug Freyburger

unread,
Jan 4, 2012, 10:56:48 AM1/4/12
to
Andrew Gabriel wrote:
> Doug Freyburger <dfre...@yahoo.com> writes:
>
>> One of the hosts I support lost its ZFS zpool and thus it's Oracle
>> database, developer home directories and such. The host rebooted over
>> the holiday and the zpool did not come up.
>
>> #powermt display dev=emcpower0c
>> Pseudo name=emcpower0a
>> CLARiiON ID=xxxx [SG_devstuff_0]
>> Logical device ID=600601602F10260012A02D3571EEDE11 [LUN 405]
>> state=alive; policy=BasicFailover; priority=0; queued-IOs=0
>> Owner: default=SP A, current=SP B
>> ==============================================================================
>> ---------------- Host --------------- - Stor - -- I/O Path - -- Stats ---
>> ### HW Path I/O Paths Interf. Mode State Q-IOs Errors
>> ==============================================================================
>> 3072 pci@1f,2000/lpfc@1/fp@0,0 c1t5006016144600F0Cd0s0 SP A1 active alive 0 0
>> 3072 pci@1f,2000/lpfc@1/fp@0,0 c1t5006016944600F0Cd0s0 SP B1 active alive 0 0
>>
>> The other LUNs are also marked "active alive".
>
> It looks like the LUNs vanished from the system, and still aren't visible.

The above clearly shows one of the LUNs visible and my comment says the
rest are as well.

> Can you see the LUNs with fdisk or format? (Don't write to them).

In format I read a "read" analysis and let it run for about a minute for
device. Definitely accessable.

Thanks for the double check!

Doug Freyburger

unread,
Jan 4, 2012, 11:44:15 AM1/4/12
to
Sami Ketola wrote:
> Doug Freyburger <dfre...@yahoo.com> wrote:
>> config:
>>
>> devstuff UNAVAIL insufficient replicas
>> emcpower0c UNAVAIL cannot open
>> emcpower1c UNAVAIL cannot open
>> emcpower2c UNAVAIL cannot open
>> emcpower3c UNAVAIL cannot open
>
> what does zdb -lv /dev/dsk/emcpower0c say? Can it find all 4 zfs labels?
> and on the other device files?

Thanks for this!

When I try any of the 4 PowerPath devices they all see the first 2
labels and they are correctly copies of each other with all the same
data. For the 3rd and 4th label they say:

--------------------------------------------
LABEL 2
--------------------------------------------
failed to read label 2
--------------------------------------------
LABEL 3
--------------------------------------------
failed to read label 3

Now I'm off to FTFM (Find) and RTFM (Read) to figure out what it means
to have 2 of 4 labels available. It sounds bad - Whatever overwrote
some of the labels likely also overwrote some of the content data.

> can you find any valid lables on any devices in /dev/dsk?

Trying the devices with the WWN in them that are listed by "powermt
display" half give a slightly differernt result:

--------------------------------------------
LABEL 2
--------------------------------------------
failed to unpack label 2
--------------------------------------------
LABEL 3
--------------------------------------------
failed to unpack label 3

The redundant paths are supposed to be to the exact same devices and
"powermt config" confirms that's true. Creepy that a different path to
the same LUN produces different ouput.

Since they are on a Clariion one device goes through the primary path
the other through the trespass path so maybe it's not completely
creepy. I suspect that just told me that all 4 LUNs to this host are
owned by SP A1. This being a development node I'm not sure how bad that
is. On my list of stuff to check is ownership of the LUNs on all of the
production hosts. Good round robin assignment of SP ownership by LUN is
something I'll pay close attention to on production hosts not on
test/dev hosts.

Going through /dev/*s0 I see the two internal drives that are mirrored
with the metastat class of commands, the DVD-ROM that is unplugged,
six devices with WWNs in their names that are not currently mapped to
the host (that would be 3 former LUNs dual pathed) and the 8 paths to
the 4 PowerPath LUNs. That confirms there are no ZFS configured LUNs
visible to this host other than the ones in the lost zpool.

>> Any suggestions? I've seen a zpool go bad at boot time before, maybe a
>> timing problem with Powerpath or DMP. I could "zpool export devstuff"
>> and "zpool import devstuff" and it worked. But that was on a different
>> host in a different data center. This time it did not work.
>
> you can always try to rename /etc/zfs/zpool.cache reboot and import.

#last reboot | head -6
reboot system boot Tue Jan 3 17:37
reboot system down Tue Jan 3 17:33
reboot system boot Mon Jan 2 10:14
reboot system down Mon Jan 2 10:10
reboot system boot Sun Jan 1 16:29
reboot system down Sun Jan 1 16:25
#ls -laF /etc/zfs
total 14
drwxr-xr-x 2 root sys 512 Jan 2 10:04 ./
drwxr-xr-x 84 root sys 5632 Jan 3 17:40 ../
#

The modify time on the directory is 6 minutes before one of the reboots.
That makes sense - The reboot on the first is when the zpool was lost
but since this system is not production it tickets but does not page. I
saw the tickets on the second and tried "zpool export devstuff" and
"zpool import devstuff" because this host has lost its zpool before and
that recovered the zpool at that time. When I did the "zpool export
devstuff" it removed zpool.cache. Sigh. So I've done that and it did
not help. Thanks for the suggestion!

Had I ever lost a zpool on any other host I'd conclude that ZFS is not
ready for prime time. But since this host is the only one with the
problem I conclude this host is no longer even appropriate for a
test/dev environment. There's a replacement host on the way to replace
it but it won't arrive for a month. Guess it's time to expedite.

The reboot yesterday was for another theory. This host has lost its
zpool before while up and another time failed to bring them up at
reboot. I wondered if it was a race condition bringing up the PowerPath
devices too late in the boot process so they are not ready when ZFS is
started. If so then booting with an exported zpool would handle that.
It did not work.

Andrew Gabriel

unread,
Jan 4, 2012, 12:26:09 PM1/4/12
to
In article <je1vkv$s35$1...@dont-email.me>,
Labels 0 and 1 are at the start of the LUN, whereas labels 2 and 3
are at the end of the LUN. Have the LUNs been truncated?

Doug Freyburger

unread,
Jan 4, 2012, 12:51:57 PM1/4/12
to
Andrew Gabriel wrote:
> Doug Freyburger <dfre...@yahoo.com> writes:
>
>> When I try any of the 4 PowerPath devices they all see the first 2
>> labels and they are correctly copies of each other with all the same
>> data. For the 3rd and 4th label they say:
>>
>> --------------------------------------------
>> LABEL 2
>> --------------------------------------------
>> failed to read label 2
>> --------------------------------------------
>> LABEL 3
>> --------------------------------------------
>> failed to read label 3
>>
>> Now I'm off to FTFM (Find) and RTFM (Read) to figure out what it means
>> to have 2 of 4 labels available. It sounds bad - Whatever overwrote
>> some of the labels likely also overwrote some of the content data.
>
> Labels 0 and 1 are at the start of the LUN, whereas labels 2 and 3
> are at the end of the LUN. Have the LUNs been truncated?

It being a test/dev host my records aren't as complete as I have for
production hosts. I can't find a record of the sizes of the 4 LUNs
originally assigned to this host. I'ts one that was rebuilt by someone
else a few years ago. Checking the sizes one is only 18 GB. Sounds
much too small. Checkign further.

cindy

unread,
Jan 4, 2012, 1:04:20 PM1/4/12
to
On Jan 4, 10:51 am, Doug Freyburger <dfrey...@yahoo.com> wrote:
> Andrew Gabriel wrote:
You might be able to review the existing LUN sizes with the format
utility
to see if they have changed.

Thanks,

Cindy

Andrew Gabriel

unread,
Jan 4, 2012, 1:23:08 PM1/4/12
to
In article <d511c4d5-752b-476c...@d9g2000yqg.googlegroups.com>,
I think the asize field in the vdev_tree part of the label is the size
of that top level vdev in bytes. See how that compares compares with
the LUN sizes. (Not sure how this works for RAIDZ.)

Doug Freyburger

unread,
Jan 5, 2012, 5:22:53 PM1/5/12
to
Andrew Gabriel wrote:
>
> I think the asize field in the vdev_tree part of the label is the size
> of that top level vdev in bytes. See how that compares compares with
> the LUN sizes. (Not sure how this works for RAIDZ.)

The guy with the local backup access is back from vacation. Welcome
back, no frying pan for you. Straight into the fire! Prod backups are
checked regularly. No luck finding *any* unexpired backups of this
host. He knows who made the change and about when but now's not the
time to deal with that. Looks like we'll need to rebuild test/dev from
prod if there's no recovery from the LUNs.

emcpower0c

Per zdb - asize=20395327488
Per format - 2 backup wu 0 - 38909 19.00GB
(38910/0/0) 39843840

emcpower1c

Per zdb - asize=71931789312
Per format - 2 backup wu 0 - 34301 67.00GB
(34302/0/0) 140500992

And so on. Thanks for something else to check. They agree. Thus
something wrote over the LUNs and killed the data. Argh.

I tried "zfs send". No luck.

Better data - All sorts of memory errors in the month before. I look
for those in my monthly health checks and would have seen them this
week. At this point I think the ZFS data loss is a symptom of the
disease not the disease itself. With many DIMMs mentioned as bad in the
/var/adm/messages in the last month it sounds like a bad CPU that took
out disk data as well as memory modules. Right now I'm on a conference
call with the client and EMC. EMC is walking him through SPCollect
right now just to double check that the storage is okay.

End game - Build an entirely new host on a VMware server. That opens
the option of Solaris x86 versus Solaris. For this application they'll
want Solaris x86.

Thanks so much for the help! I owe several favors now so I'll stay
subscribed to comp.unix.solaris in addition to comp.unix.admin.

Ian Collins

unread,
Jan 5, 2012, 5:40:13 PM1/5/12
to
On 01/ 6/12 11:22 AM, Doug Freyburger wrote:
>
> End game - Build an entirely new host on a VMware server. That opens
> the option of Solaris x86 versus Solaris. For this application they'll
> want Solaris x86.

Solaris x86 *is* Solaris.

--
Ian Collins

Andrew Gabriel

unread,
Jan 5, 2012, 5:47:20 PM1/5/12
to
In article <je57rt$5vs$1...@dont-email.me>,
Doug Freyburger <dfre...@yahoo.com> writes:
> Andrew Gabriel wrote:
>>
>> I think the asize field in the vdev_tree part of the label is the size
>> of that top level vdev in bytes. See how that compares compares with
>> the LUN sizes. (Not sure how this works for RAIDZ.)
>
> The guy with the local backup access is back from vacation. Welcome
> back, no frying pan for you. Straight into the fire! Prod backups are
> checked regularly. No luck finding *any* unexpired backups of this
> host. He knows who made the change and about when but now's not the
> time to deal with that. Looks like we'll need to rebuild test/dev from
> prod if there's no recovery from the LUNs.
>
> emcpower0c
>
> Per zdb - asize=20395327488
> Per format - 2 backup wu 0 - 38909 19.00GB
> (38910/0/0) 39843840
>
> emcpower1c
>
> Per zdb - asize=71931789312
> Per format - 2 backup wu 0 - 34301 67.00GB
> (34302/0/0) 140500992
>
> And so on. Thanks for something else to check. They agree. Thus
> something wrote over the LUNs and killed the data. Argh.

I would expect slice 2 to agree, even if the LUN shrunk (the VToC
won't have been rewritten).

Try using dd to read the last block of the LUN.
e.g.
# expr 20395327488 / 512
39834624
# dd if=/dev/dsk/emcpower0c of=/dev/null iseek=39834623 count=1
1+0 records in
1+0 records out
#

Make sure you get 1+0 records, and not 0+0 records (i.e. not off
the end of the device).

Doug Freyburger

unread,
Jan 6, 2012, 2:02:56 AM1/6/12
to
Ian Collins wrote:
> Doug Freyburger wrote:
>
>> End game - Build an entirely new host on a VMware server. That opens
>> the option of Solaris x86 versus Solaris. For this application they'll
>> want Solaris x86.
>
> Solaris x86 *is* Solaris.

Thanks for noticing that I spelled Linux wrong above.

cindy

unread,
Jan 6, 2012, 11:29:01 AM1/6/12
to
I think you are saying that you will host your data on VMware under
Solaris, which will provide more copies of your data as VM images.
You still need to consider that (unknowingly) changing LUNs with
live data will hurt.

You might consider getting a set of LUNs and mirroring them with
ZFS. In addition to frequent backups, attach another set of LUNs,
let them resilver and then split them off the pool. You will have a
replicated pool of data on those LUNs. Repeat this weekly and
watch more frequently for system/memory health issues.

Thanks,

Cindy

Cydrome Leader

unread,
Jan 6, 2012, 5:08:51 PM1/6/12
to
In comp.unix.solaris Doug Freyburger <dfre...@yahoo.com> wrote:
try veritas for your next filesystem. it doesn't explode like zfs does.

It sounds weird that the world's best most awesome checksumming filesystem
could corrupted by bad memory and not warn you at all.


cindy

unread,
Jan 6, 2012, 6:39:26 PM1/6/12
to
On Jan 6, 3:08 pm, Cydrome Leader <prese...@MUNGEpanix.com> wrote:
The zpool status command warns you about corrupted data, fmdump,
fmadm
faulty, and you can use smtp-notify in OS 11 to send you an email
notification
of a fault or error.

Doug had both LUN or LUNs? silently truncated and memory corruption
so unfortunately, a 30 day health check wasn't enough. If his pool
devices
had still been available, he would have seen the corruption in zpool
status.

For a non-redundant config with underlying LUNs that might be managed
by
someone else, I would want to check more often, like weekly.

cs

Cydrome Leader

unread,
Jan 6, 2012, 6:42:39 PM1/6/12
to
In comp.unix.solaris cindy <cindy.sw...@oracle.com> wrote:
> On Jan 6, 3:08?pm, Cydrome Leader <prese...@MUNGEpanix.com> wrote:
>> In comp.unix.solaris Doug Freyburger <dfrey...@yahoo.com> wrote:
>>
>>
>>
>> > Andrew Gabriel wrote:
>>
>> >> I think the asize field in the vdev_tree part of the label is the size
>> >> of that top level vdev in bytes. See how that compares compares with
>> >> the LUN sizes. (Not sure how this works for RAIDZ.)
>>
>> > The guy with the local backup access is back from vacation. ?Welcome
>> > back, no frying pan for you. ?Straight into the fire! ?Prod backups are
>> > checked regularly. ?No luck finding *any* unexpired backups of this
>> > host. ?He knows who made the change and about when but now's not the
>> > time to deal with that. ?Looks like we'll need to rebuild test/dev from
>> > prod if there's no recovery from the LUNs.
>>
>> > emcpower0c
>>
>> > Per zdb - ? ? ? ? asize=20395327488
>> > Per format - ?2 ? ? backup ? ?wu ? ? ? 0 - 38909 ? ? ? 19.00GB
>> > (38910/0/0) 39843840
>>
>> > emcpower1c
>>
>> > Per zdb - ? ? ? ? asize=71931789312
>> > Per format - ? 2 ? ? backup ? ?wu ? ? ? 0 - 34301 ? ? ? 67.00GB
>> > (34302/0/0) 140500992
>>
>> > And so on. ?Thanks for something else to check. ?They agree. ?Thus
>> > something wrote over the LUNs and killed the data. ?Argh.
>>
>> > I tried "zfs send". ?No luck.
>>
>> > Better data - All sorts of memory errors in the month before. ?I look
>> > for those in my monthly health checks and would have seen them this
>> > week. ?At this point I think the ZFS data loss is a symptom of the
>> > disease not the disease itself. ?With many DIMMs mentioned as bad in the
>> > /var/adm/messages in the last month it sounds like a bad CPU that took
>> > out disk data as well as memory modules. ?Right now I'm on a conference
>> > call with the client and EMC. ?EMC is walking him through SPCollect
>> > right now just to double check that the storage is okay.
>>
>> > End game - Build an entirely new host on a VMware server. ?That opens
>> > the option of Solaris x86 versus Solaris. ?For this application they'll
>> > want Solaris x86.
>>
>> > Thanks so much for the help! ?I owe several favors now so I'll stay
>> > subscribed to comp.unix.solaris in addition to comp.unix.admin.
>>
>> try veritas for your next filesystem. it doesn't explode like zfs does.
>>
>> It sounds weird that the world's best most awesome checksumming filesystem
>> could corrupted by bad memory and not warn you at all.
>
> The zpool status command warns you about corrupted data, fmdump,
> fmadm
> faulty, and you can use smtp-notify in OS 11 to send you an email
> notification
> of a fault or error.
>
> Doug had both LUN or LUNs? silently truncated and memory corruption
> so unfortunately, a 30 day health check wasn't enough. If his pool
> devices
> had still been available, he would have seen the corruption in zpool
> status.

They should really enable panic on zfs errors by default. It doesn't
really make sense a machine with a blown up filesystem would run until you
try to read or write to it.

ZFS errors can be there for days before you notice them, even though
solaris is aware of them. Sort of scary.


Michael

unread,
Jan 13, 2012, 10:02:52 AM1/13/12
to
Hi,

On 01/07/12 12:39 AM, cindy wrote:
> On Jan 6, 3:08 pm, Cydrome Leader<prese...@MUNGEpanix.com> wrote:
>> In comp.unix.solaris Doug Freyburger<dfrey...@yahoo.com> wrote:
>>
>>
><snip>

>>> Thanks so much for the help! I owe several favors now so I'll stay
>>> subscribed to comp.unix.solaris in addition to comp.unix.admin.
>>
>> try veritas for your next filesystem. it doesn't explode like zfs does.
>>
>> It sounds weird that the world's best most awesome checksumming filesystem
>> could corrupted by bad memory and not warn you at all.
>
> The zpool status command warns you about corrupted data, fmdump,
> fmadm
> faulty, and you can use smtp-notify in OS 11 to send you an email
> notification
> of a fault or error.
>
> Doug had both LUN or LUNs? silently truncated and memory corruption
> so unfortunately, a 30 day health check wasn't enough. If his pool
> devices
> had still been available, he would have seen the corruption in zpool
> status.
>
> For a non-redundant config with underlying LUNs that might be managed
> by
> someone else, I would want to check more often, like weekly.
>
> cs
>
I had maybe a similar issue atleast according to Oracles support, a
memory corruption that caused the pool to be impossible to import BUT I
was lucky that I could import it readonly!

For my case I had hoped that since it was possible to import it in
read-only that it would be also possible to have a zfs tool to make the
fs well again!

Or atleast when importing the pool read-write not even reboot did work,
I had to use mdb to force a reboot!

My fs is still in that state on the failing server in the hope for a zdb
script that can recover the pool but the work to make such as script in
not high priority at Oracle sadly.

Can you Cindy do something to escalate it perhaps? SR 3-4393435551 since
I think all effort so recover a failing pool should be made!

/michael

cindy swearingen

unread,
Jan 14, 2012, 12:05:03 PM1/14/12
to
Hi Michael,

If this pool suffered the results of memory corruption and devices
UNAVAIL,
then I think that the a read-only import is the best we can do right
now.

Are you able to copy off the data from this pool?

I looked at your SR and Victor has performed miracles in the past but
in
this case, you would have to reconstruct the data which can be a
painstaking
process.

I will talk to him about what we can do about this in the future.

In the meantime, if you are building large pools (TBs of data),
consider that
RAIDZ1 or RAIDZ2 is probably not the best choice, possibly for
performance
(mirrored pools are best for small random I/O), but mostly that the
rebuild/resilver
times will be very long. Also a reminder is that RAIDZ1 or a RAIDZ1
vdev
can only withstand the loss of one disk, RAIDZ2 or a RAIDZ2 vdev only
2
devices. With a very large pool, RAIDZ3 might be a better choice and
mirrored pools,
probably best of all.

Thanks,

Cindy

Michael

unread,
Jan 15, 2012, 1:33:43 PM1/15/12
to
Hi,

On 01/14/12 06:05 PM, cindy swearingen wrote:
> On Jan 13, 8:02 am, Michael<michael_laaja...@yahoo.com> wrote:
>> Hi,
>>
>> On 01/07/12 12:39 AM, cindy wrote:
<snip>

>>
>> I had maybe a similar issue atleast according to Oracles support, a
>> memory corruption that caused the pool to be impossible to import BUT I
>> was lucky that I could import it readonly!
>>
>> For my case I had hoped that since it was possible to import it in
>> read-only that it would be also possible to have a zfs tool to make the
>> fs well again!
>>
>> Or atleast when importing the pool read-write not even reboot did work,
>> I had to use mdb to force a reboot!
>>
>> My fs is still in that state on the failing server in the hope for a zdb
>> script that can recover the pool but the work to make such as script in
>> not high priority at Oracle sadly.
>>
>> Can you Cindy do something to escalate it perhaps? SR 3-4393435551 since
>> I think all effort so recover a failing pool should be made!
>>
>> /michael
>
> Hi Michael,
>
> If this pool suffered the results of memory corruption and devices
> UNAVAIL,
> then I think that the a read-only import is the best we can do right
> now.
>
Yes that is what I have told aswell, but if it is possible to read it it
sounds like it should be possible to reconstruct the filesystem without
a need to destroy the pool first atleast in my eyes!

> Are you able to copy off the data from this pool?
>
Yes I have copied most of to the old trustworthy E450 :) but limited
space on it(only 300GB drives) so I have spread it on some more servers
for now.


> I looked at your SR and Victor has performed miracles in the past but
> in
> this case, you would have to reconstruct the data which can be a
> painstaking
> process.
>
Somhe told me, he was planning to make a zdb script but since he managed
to help he import the pool it become less critical and I guess he also
have alot todo so no script yet.
Actually I still keep this ro-filesystem up incase he would like to
test/debug the script on a real case.

if he dont need it then I will destroy the pool when I have the time to
move the remaining filesystems.

> I will talk to him about what we can do about this in the future.
>
Please do that, we(your customers) need all safelines we can get to
avoid recreating of large filesystems.

> In the meantime, if you are building large pools (TBs of data),
> consider that
> RAIDZ1 or RAIDZ2 is probably not the best choice, possibly for
> performance
> (mirrored pools are best for small random I/O), but mostly that the
> rebuild/resilver
> times will be very long. Also a reminder is that RAIDZ1 or a RAIDZ1
> vdev
> can only withstand the loss of one disk, RAIDZ2 or a RAIDZ2 vdev only
> 2
> devices. With a very large pool, RAIDZ3 might be a better choice and
> mirrored pools,
> probably best of all.
>
> Thanks,
>
> Cindy


This filesystem is a RAIDZ2 with 6+2 drives and 2 spares aswell as 4 SSD
for ZIL and cache.

A maybe dumb question, what mirrored pool would you recommend if using
10 drives on two BHAs?

cheers


Michael

Michael

unread,
Jan 15, 2012, 1:37:02 PM1/15/12
to
Hi,

<snip>

>
> I looked at your SR and Victor has performed miracles in the past but
> in
> this case, you would have to reconstruct the data which can be a
> painstaking
> process.
>
I must add that I did get great and very nice help from Victor, I think
he knows what he is doing... ;)

Much thanks to him!

/michael

cindy

unread,
Jan 17, 2012, 12:18:17 PM1/17/12
to
Hi Michael,

Not dumb at all. If I had 10 drives across two HBAs, I would create 4
mirrored
pairs of disks across the HBAs with 2 spares.

Victor is awesome and has saved many pools when possible. I will talk
to
him about the zdb script.

Thanks,

Cindy

Michael

unread,
Jun 4, 2012, 4:27:42 PM6/4/12
to
Hi Cindy,
On 01/17/12 06:18 PM, cindy wrote:
> On Jan 15, 11:33 am, Michael<michael_laaja...@yahoo.com> wrote:
>> Hi,
>>
>> On 01/14/12 06:05 PM, cindy swearingen wrote:> On Jan 13, 8:02 am, Michael<michael_laaja...@yahoo.com> wrote:
>>>> Hi,
>>
>>>> On 01/07/12 12:39 AM, cindy wrote:
>>
>> <snip>
<snip>
Its been a while but vacation comming so time for planning to update
everything.

I have to ask you some more, maybe even dumber the the previous question :)

You suggest this, two HBA's each one with 4 drives running and 1 spare,
right and all in one pool? See below v1, v2, v3 v4 is vdevs!

v1 v2 v3 v4
HBA-A D1 D2 D3 D4 S1
HBA-B D1 D2 D3 D4 S1

Compared to raidz3 with 8 drives and 2 spares.

What is the reason for not going raidz3, from my mind raidz3 can survive
3 drives failure, but the mirrored if both drives in the same vdev fails
then it all failes or?

Or does a raidz2/3 resilver stress the drives more than a mirrored rebuild?

A failing HBA is nothing I worry about(I think), that can be replaced
within 30 minutes!

cheers

Michael

cindy

unread,
Jun 4, 2012, 5:47:57 PM6/4/12
to
Hi Michael,

I'm not quite following your mirrored VDEV arrangement, but I would do
something
like this with 4 disks across 2 HBAs, say c1 and c2:

# zpool create tank mirror c1t0d0 c2t0d0 mirror c1t1d0 c2t1d0 mirror
c1t2d0 c2t2d0
spare c1t3d0 c2t3d0

If two disks failed in the same VDEV but the spare kicked and
resilvered before the
second one failed, then you would be okay.

Other things to consider between mirrored and RAIDZ configs:

1. Mirrored pools perform well for most workloads, particularly small
reads and writes
2. RAIDZ pools perform well for large I/O workloads but generally not
for small reads/writes
3. RAIDZ pools (in your example with 6 disks in one raid3 VDEV) will
take longer
to resilver
4. Mirrored pools are more flexible. Yan attach and detach devices and
can add
VDEVs.

Either way, always have good backups.

Thanks,

Cindy

Michael

unread,
Jun 5, 2012, 2:34:46 AM6/5/12
to
Hi,
On 06/04/12 11:47 PM, cindy wrote:
> On Jun 4, 2:27 pm, Michael<michael_laaja...@yahoo.com> wrote:
>> Hi Cindy,
>> On 01/17/12 06:18 PM, cindy wrote:
>>
>>
>>
>>
>>
<snip>

>>
>> cheers
>>
>> Michael
>
> Hi Michael,
>
> I'm not quite following your mirrored VDEV arrangement, but I would do
> something
> like this with 4 disks across 2 HBAs, say c1 and c2:
>
> # zpool create tank mirror c1t0d0 c2t0d0 mirror c1t1d0 c2t1d0 mirror
> c1t2d0 c2t2d0
> spare c1t3d0 c2t3d0
>
> If two disks failed in the same VDEV but the spare kicked and
> resilvered before the
> second one failed, then you would be okay.
>
First, thanks for helping me out :)

If both disks fails in the same VDEV and the spare did not kick in and
resilvered, is the whole pool gone then?

> Other things to consider between mirrored and RAIDZ configs:
>
> 1. Mirrored pools perform well for most workloads, particularly small
> reads and writes
> 2. RAIDZ pools perform well for large I/O workloads but generally not
> for small reads/writes
> 3. RAIDZ pools (in your example with 6 disks in one raid3 VDEV) will
> take longer
> to resilver
> 4. Mirrored pools are more flexible. Yan attach and detach devices and
> can add
> VDEVs.
>
> Either way, always have good backups.
>
Yes, but its getting harder and harder I think we have some 20+ tapes
for a full backup real pain since its not a autoloader (:

Oh yes one more question, during resilver in a raidz2 all driver are
pretty hard stressed, in a mirrored like you suggesting only two drives
is stressed(the remaining and the spare) right?

cheers

Michael


cindy

unread,
Jun 6, 2012, 10:57:44 AM6/6/12
to
Yes, if both disks in the same VDEV failed, which is unlikely, I
think,
then the pool would not have a complete copy of your data. Its
important
to monitor and maintain your pools so keep watching with zpool status,
fmdump and considering using smtp-notify. One problem I see is that
admins are unaware that their disks are failing.

Yes, only the remaining disk and spare are impacted by resilvering.

Thanks, Cindy
0 new messages