Checksum errors on a drive currently being replaced

Tan Chee Eng

unread,

Oct 11, 2013, 5:51:53 AM10/11/13

to zfs-...@googlegroups.com, Joshua Ng

Hi,

I've tried by best to search for the answer for this, but there was too much noise -- in pretty much all the discussions on checksum errors and replacing drives, people are asking about replacing a drive after seeing checksum errors. I'm hoping somebody here would be able to help me understand this.

One of my pools experienced two hard drive failures (fortunately from two different vdevs). I've replaced the hard drive, but I've been seeing CKSUM errors popping up on the device that's being replaced. Here's a dump of "zpool status" on my machine:

pool: archivepool
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scrub: resilver in progress for 0h1m, 0.02% done, 107h55m to go
config:
NAME STATE READ WRITE CKSUM
archivepool DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
replacing-0 DEGRADED 0 0 277
6312875129823442296 UNAVAIL 0 0 0 was /dev/disk/by-id/usb-WD_Ext_HDD_1021_574D415A4132393938343534-0:0
disk/by-id/usb-Seagate_Backup+_Desk_NA5KKM31-0:0 ONLINE 0 0 0
disk/by-id/usb-WD_Ext_HDD_1021_5743415A4132353337343532-0:0 ONLINE 0 0 0
disk/by-id/usb-WD_Ext_HDD_1021_5743415A4133343235393936-0:0 ONLINE 0 0 0
disk/by-id/usb-WD_Ext_HDD_1021_5743415A4133323837353135-0:0 ONLINE 0 0 0
raidz1-1 DEGRADED 0 0 0
disk/by-id/usb-BUFFALO_External_HDD_01092000209C-0:0 ONLINE 0 0 0
disk/by-id/usb-BUFFALO_HD-CXU2_0010100702091C720-0:0 ONLINE 0 0 0
disk/by-id/usb-BUFFALO_HD-CXU2_00101007020C94C80-0:0 ONLINE 0 0 0
replacing-3 DEGRADED 0 0 132
4051679456373771497 UNAVAIL 0 0 0 was /dev/disk/by-id/usb-BUFFALO_HD-CXU2_0010100702071DD80-0:0
disk/by-id/usb-Seagate_Backup+_Desk_NA5KKM3P-0:0 ONLINE 0 0 0
errors: No known data errors

As you can see, the pool is resilvering, but I'm seeing checksum errors on replacing-0 and replacing-3. I understand what checksum errors on the pool, vdev, or device means, but what does it mean on a drive that isn't even a part of the vdev yet? Is my new drive bad, or did I just loose data?

Regards,

Chee Eng

Ethan

unread,

Oct 24, 2013, 4:14:32 PM10/24/13

to zfs-...@googlegroups.com

I don't know if you found your answer yet (I don't look at this list very often), but from what I understand, you should be fine. I'm no expert, just a user who has encountered occasional device failure and drive corruption, so take this with a grain of salt.

The fact that you have "errors: No known data errors" means you haven't lost data. If you had lost data, you'd see a cksum error on the vdev and/or pool, that message would say "errors: 149 data errors, use '-v' for a list", and the status would say "status: One or more devices has experienced an error resulting in data corruption. Applications may be affected." (I don't know if it would say this during the resilvering or after, though.)

I don't know why you have checksum errors on the replacing devices, and I can't recall whether I have seen that myself when replacing devices, but I'd wait for the resilver to finish (which I suppose it has by now), do a scrub, and see if everything looks clear. keep an eye on it in the future and scrub from time to time, if the drive shows further checksum errors, I'd replace it.

--
--
To post to this group, send email to zfs-...@googlegroups.com
To visit our Web site, click on http://zfs-fuse.net/
---
You received this message because you are subscribed to the Google Groups "zfs-fuse" group.
To unsubscribe from this group and stop receiving emails from it, send an email to zfs-fuse+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Tan Chee Eng

unread,

Oct 24, 2013, 9:38:28 PM10/24/13

to zfs-...@googlegroups.com

Hi Ethan,

Thanks for replying. Anyway, just as an update to my situation, the resilver was extremely slow and finally crashed zfs-fuse a few days later. I switched to ZFS on Linux, at which point I realised something - the pool was created with 512 sector drives (ashift=9), but the drives I bought to replace them were advanced format drives. It looks like my options now are to back up my data somewhere and recreate the pool with ashift=12.

I'll need to find some place to borrow 8TB+ of hard drives, though... that might be a bit difficult.