zfs-fuse: sending ioctl 2285 to a partition

Gordan Bobic

unread,

May 28, 2015, 12:56:42 PM5/28/15

to zfs-...@googlegroups.com

I've been looking at the above issue. I have seen references to it having been reported a long time ago to:

http://zfs-fuse.net/issues/143

which is long gone.

As far as I can tell, the offending code is in function flushSCSIwc in 
lib/libzpool/flushwc.c

Please forgive me if this is a stupid question, but is there a reason why instead of 
ioctl(fd, SG_IO, &io_hdr)

it would not be appropriate to use something like

fsync(fd)

Would the latter not work appropriately on a raw block device?

Gordan Bobic

unread,

May 29, 2015, 5:18:47 AM5/29/15

to zfs-...@googlegroups.com

Looking at this flushwc.c a little further, it looks suspiciously like only SCSI and ATA disk cases are considered. I have it running on an SD card and it seems to work, but according to the code in that file, if it's not an IDE or SCSI disk:

default:
//Unknown block device driver. Can't flush the write cache.
return ENOTSUP;

Whatever it does in this case clearly isn't fatal. But looking at how this works, I think this entire file could be simplified to a single function:
***
int flushwc(vnode_t *vn) {
int major_number;

if(!S_ISBLK(vn->v_stat.st_mode))
    // We can only flush the write cache of a block device.
    return ENOTSUP;

    return ioctl (vn->v_fd, BLKFLSBUF, 0);
***

That should work universally for flushing caches on any block device (it's a higher level ioctl call than for a raw disk), which means that SCSI and ATA specific functions could be removed.

Would anyone care to voice an opinion, or at least willingness to test on some scratch data under various harsh conditions (e.g. pulling power while under heavy write load) vs. the behavious with the original implementation?

Many thanks.

Gordan

Gordan Bobic

unread,

May 29, 2015, 12:23:41 PM5/29/15

to zfs-...@googlegroups.com

So I've been reading up on the background of this issue. The core of the problem is that the disk flush SCSI or ATA command is being sent through to a partition. When raw disks are used with zfs-fuse, this doesn't happen because the command then gets sent to the fd that is the raw block device rather than a partition.

What is not clear (to me at least) is what happens when the SCSI flush cache command gets sent to the partition, i.e. whether the kernel issues it to the disk, or whether it is discarded with a warning. If the latter is what happens, then the current implementation has broken flushing anyway when partitions are used. It is worth pointing out at this stage that at least Solaris and ZoL default to setting up partitions and using those, even when the raw disk device nodes are specified at pool creation time.

So, options:

1) If disk flush command is ignored, then something like the attached patch would make the situation a little less bad.

What it does is check whether the minor device node number implies the device is a partition, and if it is a partition it does:

fsync(fd);

ioctl(fd, BLKFLSBUF, 0);

It also does this if the block device is neither SCSI nor ATA, since those calls should be generically supported on all block devices.

Neither of the above calls issue a flush of hardware disk caches, but it should be better than nothing, and it ought to do away with the warning resulting of sending a SCSI command to a partition.

2) Infer the disk node itself, preferably at pool import time, and store it somehow/somewhere (maybe in the vnode in memory, assuming that doesn't upset reflections between in-memory and on-disk data formats), and invoke the cache write-out ioctl on that. Workable but a little dirty and diverging further from other implementations.

3) Try to mimic more closely what other implementations do. Unfortunately, different forks seems to disagree on what to do.

FreeBSD doesn't appear to issue a cache flush here at all:
https://github.com/freebsd/freebsd/blob/master/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_file.c

ZoL does, but it does so via some SPL (Solaris Porting Layer) magic by dispatching a vdev_file_io_fsync task to the vdev_file_taskq:
https://github.com/zfsonlinux/zfs/blob/master/module/zfs/vdev_file.c

Addopting the required SPL functionality into zfs-fuse may be an option (I have to track through the code to see just how much i would need to pull out for all the dependencies of this path to work, and more importantly, to figure out whether there could be side effects of that code running from userspace rather than from within the kernel. In short - complicated. Perhaps too complicated for a somebody who has only just started looking at ZFS code for the first time.

Any thoughts on this? Especially the part above regarding whether my patch to handle partitions different (by not issuing a SCSI/ATA command to a partition) will make the cache flushing situation worse instead of better.