g_vfs_open and bread(devvp, ...)

Andriy Gapon

unread,

Mar 17, 2010, 5:52:32 AM3/17/10

to freeb...@freebsd.org, freebs...@freebsd.org, Bruce Evans

I've given a fresh look to the issue of g_vfs_open and bread(devvp, ...) in
filesystem code. This time I hope to present my reasoning more clearly than I did
in my previous attempts.
For this reason I am omitting historical references and dramatic
examples/demonstrations but they are still available upon request (and in archives).
I hope that my shortened notation in references to the code and data structures
will not be confusing.

bread() and the API family it belongs to is an interface to buffer cache system.
Buffer cache system can be roughly divided into two parts: cache part and I/O path
part.
I think that it is understood that both parts must have the same notion of a block
size when translating a block number to a byte offset. (If this point is not
clear, I can follow up with another essay).

In the case of cache code the translation is explicit and simple:
A) offset = blkno * bo_bsize

In the case of I/O code the translation is not that straightforward, because it
can be altered/overridden by bop_strategy which can in turn hook vop_strategy, etc.

Let's consider a simple case of a filesystem that:
a) connects to a geom provider via g_vfs_open
b) doesn't modify anything in devvp->v_bufobj (in particular bo_ops and bo_bsize)
c) uses bread(devvp) (e.g. to access an equivalent of superblock, etc)

Short overview of geom_vfs glue:
1) g_vfs_open sets devvp->v_bufobj.bo_ops to g_vfs_bufops, where bop_strategy is
g_vfs_strategy
2) bo_bsize is set to pp->sectorsize
3) g_vfs_strategy doesn't perform any block-to-offset translation of its own, it
expects b_iooffset to be correctly set and passes its value to bio_offset

When a filesystem issues bread(devvp) the following happens in the I/O path:
I) bread() calls breadn()
II) in breadn(): bp->b_iooffset = dbtob(bp->b_blkno), that is b_iooffset is set to
blkno * DEV_BSIZE (where DEV_BSIZE is 512)
III) breadn() then calls bstrategy() which is a simple wrapper around BO_STRATEGY
IV) g_vfs_strategy gets called and, as described in (3) above, it simply passes on
b_iooffset value to bio_offset
V) thus, a block size used for I/O operation is 512 (DEV_BSIZE)
VI) on the other hand, as stated in (A) above, block size used in caching code is
bo_bsize

Thus, if bo_bsize != DEV_BSIZE, or alternatively said, pp->sectorsize != 512, we
have a trouble of data getting cached with incorrect offsets.

Additionally, from (V) above we must conclude that a filesystem must specify block
numbers in DEV_BSIZE units to bread(devvp, blkno, ...) if the conditions (a), (b),
(c) are met. In fact, all such filesystems already do that, because otherwise
they would read incorrect data from the media.

So, the problem is only with the caching of the data.
As such, this issue has little practical effects, because only a small number of
reads is done via devvp and only for sufficiently small chunks of data (I hope).
fs/udf used to be greatly affected by this issue when it was reading directory
nodes via devvp, but that was in the past (prior to 189082).

Still I think that we should fix this issue for general code correctness/quality
reasons. And also to avoid possible future bugs.

As demonstrated by (V) and (VI) above, obvious and easiest fix is to (always) set
bo_bsize to DEV_BSIZE in g_vfs_open():
--- a/sys/geom/geom_vfs.c
+++ b/sys/geom/geom_vfs.c
@@ -179,7 +179,7 @@ g_vfs_open(struct vnode *vp, struct g_consumer **cpp, const
char *fsname, int wr
bo = &vp->v_bufobj;
bo->bo_ops = g_vfs_bufops;
bo->bo_private = cp;
- bo->bo_bsize = pp->sectorsize;
+ bo->bo_bsize = DEV_BSIZE;
gp->softc = bo;

return (error);

I tested this change with ufs, udf and cd9660 and haven't observed any regressions.

P.S.
There might something to changing bread(devvp) convention, so that blkno is
expected in sectorsize units. But setting bo_bsize to sectorsize is only a tiny
portion of what needs to be changed to make it actually work.

--
Andriy Gapon

Andriy Gapon

unread,

Mar 26, 2010, 9:17:41 AM3/26/10

to freeb...@freebsd.org, freebs...@freebsd.org

Will an offer of a beer help reviewing this change? :-)

on 17/03/2010 11:52 Andriy Gapon said the following:

Kirk McKusick

unread,

Mar 26, 2010, 11:28:10 AM3/26/10

to Andriy Gapon, freeb...@freebsd.org, freebs...@freebsd.org

I have reviewed your change and I believe that your analysis is
correct. I am in agreement with your making the change.

As disk sector sizes will be growing in the near future, it would
be desirable to get away from having DEV_BSIZE hard-coded. But as
you note, that is a far bigger change than this one.

Kirk McKusick

Jeremy Chadwick

unread,

Mar 26, 2010, 12:15:39 PM3/26/10

to Kirk McKusick, freeb...@freebsd.org, Andriy Gapon, freebs...@freebsd.org

I should note that they already have grown: Western Digital, as of a few
months ago, began shipping drives that use 4KByte sectors. They're
known as the "EARS" drives, due to their model string ending with
"EARS":

WD20EARS: http://www.wdc.com/en/products/Products.asp?DriveID=773
WD15EARS: http://www.wdc.com/en/products/products.asp?driveid=772
WD10EARS: http://www.wdc.com/en/products/products.asp?driveid=763

(I should warn folks these are Caviar Green drives, which may suffer
from excessive Load Cycles (parking/unparking actuator arm). I don't
have one of these drives so I can't validate if the issue happens on
this model or not)

A discussion and an including an incredibly cheesy video review are
below. The video review does discuss the 4KB sector size, in addition
to jumpers that revert the drive to using 512-byte sectors for older
OSes such as Windows XP -- and presumably FreeBSD.

http://www.tomshardware.com/reviews/wd-4k-sector,2554.html
http://www.youtube.com/watch?v=QeFj2QTaA3Y

Dimitry Andric

unread,

Mar 26, 2010, 2:21:20 PM3/26/10

to Jeremy Chadwick, Kirk McKusick, freeb...@freebsd.org, Andriy Gapon, freebs...@freebsd.org

On 2010-03-26 17:15, Jeremy Chadwick wrote:
> I should note that they already have grown: Western Digital, as of a few
> months ago, began shipping drives that use 4KByte sectors.

...

> A discussion and an including an incredibly cheesy video review are
> below. The video review does discuss the 4KB sector size, in addition
> to jumpers that revert the drive to using 512-byte sectors for older
> OSes such as Windows XP -- and presumably FreeBSD.

Please note these drives *always* expose 512-byte sectors to any OS, at
least for now.

The jumper you refer to is only a hack to force sector 63 (the usual
starting position for the first partition) to be aligned on a 4096-byte
boundary. If you would remove it after partitioning, all sectors would
shift up one sector, and there would be trouble. :)

Scot Hetzel

unread,

Mar 26, 2010, 2:21:41 PM3/26/10

to Jeremy Chadwick, Kirk McKusick, freeb...@freebsd.org, Andriy Gapon, freebs...@freebsd.org

After reviewing these links, my understanding of these drives that
they still provide 512 Byte sectors to the O/S, but when they write to
the drive, it will pack eight 512 Byte sectors into a 4K sector on the
drive. When the drive needs to modify a sector it has to read the
entire 4K sector before writing the change to the drive. This could
lead to excessive Read-Modify-Writes if the partition is not aligned
on a 4K sector as it will will reduce the performance of these drives.
Each partition must be aligned to start and end on a 4K sector.

The problem with Windows XP, is that it always creates the first
partition starting at sector 63, which is not on a 4k boundary. When
the jumpers are set, the drive adds 1 to all 512 byte sector request.
For example, the OS asks for sector 63, the drive returns the contents
of sector 64, thus forcing the alignment of the partitions to start at
a 4K sector.

The other option is to use the WD align program on Windows XP. This
software re-aligns the partions and data so that they are aligned to
the 4k sector.

In order for FreeBSD to use these drives, you just need to ensure that
all slices/partitions start at a 4k boundary.

Scot

Andriy Gapon

unread,

Mar 26, 2010, 3:10:15 PM3/26/10

to freeb...@freebsd.org, freebs...@freebsd.org

To no one in particular: guys, I think there are better threads to have a
discussion on 4K HDD sectors.
This one was started on something only tangentially related.

--
Andriy Gapon

Andriy Gapon

unread,

Apr 1, 2010, 1:38:03 PM4/1/10

to freeb...@freebsd.org, freebs...@freebsd.org, Kirk McKusick, Kostik Belousov, Poul-Henning Kamp, Bruce Evans

Some bad news.
Apparently I missed the fact that in some places bo_bsize is used as sectorsize
value of underlying provider. That is, there is code that assumes that devvp's
bo_bsize == sectorsize.

One such place is vnode_pager_generic_getpages() in sys/vm/vnode_pager.c.
That code directly calls bstrategy() on bufobj of devvp and because of that it
wants to round up read size to media sectorsize (there would be a KASSERT panic in
GEOM if I/O size is not multiple of sectorsize).
Thus the code needs to know sectorsize.

The other possible place is ffs_rawread.

I am not sure how to get sectorsize value in those places in a nice way, i.e.
without kludges or violating logical layering.
I.e. I could access bo_private->provider->sectorsize, but that seems to be too
hack-ish.
If anyone knows of a proper way to do this, please let me know.

For now I am thinking about resorting to a different, slightly less kludgy (IMO)
approach.
Leave devvp's bo_bsize to mean provider's sectorsize, but instead in getblk()
account for devvp's blkno being in units of DEV_BSIZE:
--- a/sys/kern/vfs_bio.c
+++ b/sys/kern/vfs_bio.c
@@ -2700,7 +2700,7 @@
*/
if (flags & GB_NOCREAT)
return NULL;
- bsize = bo->bo_bsize;
+ bsize = vn_isdisk(vp, NULL) ? DEV_BSIZE : bo->bo_bsize;
offset = blkno * bsize;
vmio = vp->v_object != NULL;
maxsize = vmio ? size + (offset & PAGE_MASK) : size;

--
Andriy Gapon

Andriy Gapon

unread,

Apr 9, 2010, 10:25:59 AM4/9/10

to freeb...@freebsd.org, freebs...@freebsd.org

I came up with a small demonstration for why I bothered with this issue at all.
To reproduce the experiment you would need this image:
http://people.freebsd.org/~avg/test_img.gz
and avgfs:)
http://people.freebsd.org/~avg/avgfs/

The image needs to be gunzip-ed, of course.
avgfs:) needs to be compiled, avgfs.ko loaded.
The demonstration is to be executed on an unpatched system, e.g. head before
r205860, or a branch other than head, or recent head with r206097 and r205860
reverted.

Then do the following.
I. Inspect test image.
$ hd test_img.img | more

II. Test bread() through avgfs vnode.
1. Present the image as a disk with 2K sector size
mdconfig -a -t vnode -f test_img.img -S 2048 -u 0
2. Mount the image using avgfs:)
mount -t avg /dev/md0 /mnt
3. Read some data blocks in one go and examine the result.
dd if=/mnt/thefile bs=10k count=1 | hd
4. Re-read the same data using 2K blocks and examine the result.
dd if=/mnt/thefile bs=2k count=5 | hd
5. Cleanup.
umount /mnt

III. Test bread() through devvp.
1. Mount the image using avgfs:) with devvp option.
mount -t avg -o devvp /dev/md0 /mnt
2. Read some data blocks in one go and examine the result.
dd if=/mnt/thefile bs=10k count=1 | hd
3. Re-read the same data using 2K blocks and examine the result.
dd if=/mnt/thefile bs=2k count=5 | hd
4. Cleanup.
umount /mnt
mdconfig -d -u 0
kldunload avgfs

SPOILER.
In my testing only III.3 produces an unexpected result:
00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000800 04 04 04 04 04 04 04 04 04 04 04 04 04 04 04 04 |................|
*
00001000 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 |................|
*
00001800 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 |................|
*
00002000 04 04 04 04 04 04 04 04 04 04 04 04 04 04 04 04 |................|
*

The result is explained here:
http://lists.freebsd.org/pipermail/freebsd-fs/2008-February/004268.html

Repeat the experiment on a patched system.
See if the change was worth bothering.

P.S. avgfs:) is explained here:
http://permalink.gmane.org/gmane.os.freebsd.devel.file-systems/8886

--
Andriy Gapon