Continuous crashing ZFS server

Willem Jan Withagen

unread,

Jun 8, 2018, 6:06:22 AM6/8/18

to

Hi,

My file server is crashing about every 15 minutes at the moment.
The panic looks like:

Jun 8 11:48:43 zfs kernel: panic: Solaris(panic): zfs: allocating
allocated segment(offset=12922221670400 size=24576)
Jun 8 11:48:43 zfs kernel:
Jun 8 11:48:43 zfs kernel: cpuid = 1
Jun 8 11:48:43 zfs kernel: KDB: stack backtrace:
Jun 8 11:48:43 zfs kernel: #0 0xffffffff80aada57 at kdb_backtrace+0x67
Jun 8 11:48:43 zfs kernel: #1 0xffffffff80a6bb36 at vpanic+0x186
Jun 8 11:48:43 zfs kernel: #2 0xffffffff80a6b9a3 at panic+0x43
Jun 8 11:48:43 zfs kernel: #3 0xffffffff82488192 at vcmn_err+0xc2
Jun 8 11:48:43 zfs kernel: #4 0xffffffff821f73ba at zfs_panic_recover+0x5a
Jun 8 11:48:43 zfs kernel: #5 0xffffffff821dff8f at range_tree_add+0x20f
Jun 8 11:48:43 zfs kernel: #6 0xffffffff821deb06 at metaslab_free_dva+0x276
Jun 8 11:48:43 zfs kernel: #7 0xffffffff821debc1 at metaslab_free+0x91
Jun 8 11:48:43 zfs kernel: #8 0xffffffff8222296a at zio_dva_free+0x1a
Jun 8 11:48:43 zfs kernel: #9 0xffffffff8221f6cc at zio_execute+0xac
Jun 8 11:48:43 zfs kernel: #10 0xffffffff80abe827 at
taskqueue_run_locked+0x127
Jun 8 11:48:43 zfs kernel: #11 0xffffffff80abf9c8 at
taskqueue_thread_loop+0xc8
Jun 8 11:48:43 zfs kernel: #12 0xffffffff80a2f7d5 at fork_exit+0x85
Jun 8 11:48:43 zfs kernel: #13 0xffffffff80ec4abe at fork_trampoline+0xe
Jun 8 11:48:43 zfs kernel: Uptime: 9m7s

Maybe a known bug?
Is there anything I can do about this?
Any debugging needed?

System is running FreeBSD 11.1-RELEASE-p10

Thanx,
--WjW
_______________________________________________
freebsd...@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stabl...@freebsd.org"

Andriy Gapon

unread,

Jun 11, 2018, 6:26:02 AM6/11/18

to

Sorry to inform you but your on-disk data got corrupted.
The most straightforward thing you can do is try to save data from the pool in
readonly mode.

--
Andriy Gapon

Willem Jan Withagen

unread,

Jun 11, 2018, 6:31:10 AM6/11/18

to

Hi Andriy,

Auch, that is a first in 12 years of using ZFS. "Fortunately" it was of
a test ZVOL->iSCSI->Win10 disk on which I spool my CAMs.

Removing the ZVOL actually fixed the rebooting, but now the question is:
Is the remainder of the zpools on the same disks in danger?

--WjW

Andriy Gapon

unread,

Jun 11, 2018, 7:06:01 AM6/11/18

to

You can try to check with zdb -b on an idle (better exported) pool. And zpool
scrub.

--
Andriy Gapon

Willem Jan Withagen

unread,

Jun 11, 2018, 7:33:16 AM6/11/18

to

If scrub says things are oke, I can start breathing again?
exporting the pool is something for the small hours.

Thanx,
--WjW

Stefan Wendler

unread,

Jun 11, 2018, 8:39:48 AM6/11/18

to

Do you use L2ARC/ZIL disks? I had a similar problem that turned out to
be a broken caching SSD. Scrubbing didn't help a bit because it reported
that data was okay. And SMART was fine as well. Fortunately I could
still send/recv snapshots to a backup disk but wasn't able to replace
the SSDs without a pool restore. ZFS just wouldn't sync some older ZIL
data to disk and also wouldn't release the SSDs from the pool. Did you
also check the logs for entries that look like broken RAM?

Cheers,
Stefan

--
Stefan Wendler
stefan....@tngtech.com
+49 (0) 176 - 2438 3835
Senior Consultant

TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Dr. Robert Dahlke, Gerhard Müller
Sitz: Unterföhring * Amtsgericht München * HRB 135082

Willem Jan Withagen

unread,

Jun 11, 2018, 8:53:05 AM6/11/18

to

On 11-6-2018 14:35, Stefan Wendler wrote:
> Do you use L2ARC/ZIL disks? I had a similar problem that turned out to
> be a broken caching SSD. Scrubbing didn't help a bit because it reported
> that data was okay. And SMART was fine as well. Fortunately I could
> still send/recv snapshots to a backup disk but wasn't able to replace
> the SSDs without a pool restore. ZFS just wouldn't sync some older ZIL
> data to disk and also wouldn't release the SSDs from the pool. Did you
> also check the logs for entries that look like broken RAM?

That was one of the things I looked for, bad things in log files.
But the server does not deem to have any hardware problems.

I'll dive a bit deeper into my ZIL SSDs

Thanx,
--WjW

Stefan Wendler

unread,

Jun 11, 2018, 9:03:50 AM6/11/18

to

Under normal circumstances you can just add/remove the caches from the
pool while the system is running. If something is fishy here then ZFS
should inform you that there is still "dirty" data that has to be synced
if you you try to remove the cache. I don't know the exact message but
it is pretty clear.