I recently had some issue with a PSU and ran several scrubs on a pool with
around 35T. Random drives would drop and require a zpool online, this found
checksum errors. (as expected) However, after all the scrubs I ran, I think
I may have found a bug with zpool online resilvering process.
24 disks total, 4 vdevs raidz2 (6 drives each).
Before this next part... I had a backup PSU, however it was also going bad
and waiting for RMA. The current one seemed to be dieing but ran fine with
less drives. So I decided I would run the server short 4 drives.
Started by offline(or already removed from psu) 4 drives from different
vdevs, then ran a scrub to verify everything. Many sum errors were present
on some of the drives, but this was expected due to faulty psu. Then
offlined 4 different drives and onlined the other 4 and scrubbed once
again. After resilver, again, many sum errors on these drives as expected.
After the scrub completed, I decided to offline 4 different drives, then
online the ones that were out of pool for awhile. During the resilver,
checksum errors were once again found. I was surprised due to the recent
scrub, So I decided to run another scrub, and it found even more checksum
errors on these recently onlined drives. I didn't think much about it,
however after the replacement PSU arrived, I onlined all the drives out of
pool and again, resilver had checksum errors as well as another scrub with
more sum errors.
Is this issue known? Is it common for a scrub to be required after onlining
a disk that was out of pool for some time?
The drives are ST4000NM0033, and until recent have never had a single
checksum error in they're lifetime.(at least with zfs)
FreeBSD S1 12.0-CURRENT FreeBSD 12.0-CURRENT #19 r303224: Sat Jul 23
10:41:12 EDT 2016
Sorry for the wall of text, but I hope this helps in tracking down this
To unsubscribe, send any mail to "freebsd-curre...@freebsd.org