[ANNOUNCE] Btrfs v0.12 released

Chris Mason

unread,

Feb 6, 2008, 12:01:44 PM2/6/08

to linux-...@vger.kernel.org, linux-...@vger.kernel.org, btrfs...@oss.oracle.com

Hello everyone,

I wasn't planning on releasing v0.12 yet, and it was supposed to have some
initial support for multiple devices. But, I have made a number of
performance fixes and small bug fixes, and I wanted to get them out there
before the (destabilizing) work on multiple-devices took over.

So, here's v0.12. It comes with a shiny new disk format (sorry), but the gain
is dramatically better random writes to existing files. In testing here, the
random write phase of tiobench went from 1MB/s to 30MB/s. The fix was to
change the way back references for file extents were hashed.

Other changes:

Insert and delete multiple items at once in the btree where possible. Back
references added more tree balances, and it showed up in a few benchmarks.
With v0.12, backrefs have no real impact on performance.

Optimize bio end_io routines. Btrfs was spending way too much CPU time in the
bio end_io routines, leading to lock contention and other problems.

Optimize read ahead during transaction commit. The old code was trying to
read far too much at once, which made the end_io problems really stand out.

mount -o ssd option, which clusters file data writes together regardless of
the directory the files belong to. There are a number of other performance
tweaks for SSD, aimed at clustering metadata and data writes to better take
advantage of the hardware.

mount -o max_inline=size option, to override the default max inline file data
size (default is 8k). Any value up to the leaf size is allowed (default
16k).

Simple -ENOSPC handling. Emphasis on simple, but it prevents accidentally
filling the disk most of the time. With enough threads/procs banging on
things, you can still easily crash the box.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

David Miller

unread,

Feb 10, 2008, 8:12:58 PM2/10/08

to chris...@oracle.com, linux-...@vger.kernel.org, linux-...@vger.kernel.org, btrfs...@oss.oracle.com

From: Chris Mason <chris...@oracle.com>
Date: Wed, 6 Feb 2008 12:00:13 -0500

> So, here's v0.12.

I couldn't even make a filesystem on sparc64 without the following
patch.

The first problem is that these SETGET macros lose typing information,
and therefore can't see the 'packed' attribute and therefore take
unaligned access SIGBUS signals on sparc64 when trying to derefernce
the member.

The next problem is a similar issue in btrfs_name_hash(). This gets
passed things like &key.offset which is a member of a packed
structure, losing this packed'ness information btrfs_name_hash()
performs a potentially unaligned memory access, again resulting in a
SIGBUS.

This function never returns an error, so the simplest fix was to
return the hash value which avoids all of the issues. In attempting
other schemes to fix this, I found it very difficult to give gcc
a packed attribute for that "u64 *" argument other than to create
some new pseudo structure which would have been ugly.

Similar code lives in the btrfs kernel code too, I'll try to get a
partition at least mounted and working minimally and if successful
I'll send you patches for that too.

diff -up --recursive --new-file vanilla/btrfs-progs-0.12/ctree.h btrfs-progs-0.12/ctree.h
--- vanilla/btrfs-progs-0.12/ctree.h 2008-02-06 08:37:45.000000000 -0800
+++ btrfs-progs-0.12/ctree.h 2008-02-10 16:53:24.000000000 -0800
@@ -451,18 +451,16 @@ static inline void btrfs_set_##name(stru
static inline u##bits btrfs_##name(struct extent_buffer *eb, \
type *s) \
{ \
- unsigned long offset = (unsigned long)s + \
- offsetof(type, member); \
- __le##bits *tmp = (__le##bits *)(eb->data + offset); \
- return le##bits##_to_cpu(*tmp); \
+ unsigned long offset = (unsigned long)s; \
+ type *p = (type *) (eb->data + offset); \
+ return le##bits##_to_cpu(p->member); \
} \
static inline void btrfs_set_##name(struct extent_buffer *eb, \
type *s, u##bits val) \
{ \
- unsigned long offset = (unsigned long)s + \
- offsetof(type, member); \
- __le##bits *tmp = (__le##bits *)(eb->data + offset); \
- *tmp = cpu_to_le##bits(val); \
+ unsigned long offset = (unsigned long)s; \
+ type *p = (type *) (eb->data + offset); \
+ p->member = cpu_to_le##bits(val); \
}

#define BTRFS_SETGET_STACK_FUNCS(name, type, member, bits) \
diff -up --recursive --new-file vanilla/btrfs-progs-0.12/dir-item.c btrfs-progs-0.12/dir-item.c
--- vanilla/btrfs-progs-0.12/dir-item.c 2008-02-06 08:37:45.000000000 -0800
+++ btrfs-progs-0.12/dir-item.c 2008-02-10 17:03:34.000000000 -0800
@@ -71,8 +71,7 @@ int btrfs_insert_xattr_item(struct btrfs

key.objectid = dir;
btrfs_set_key_type(&key, BTRFS_XATTR_ITEM_KEY);
- ret = btrfs_name_hash(name, name_len, &key.offset);
- BUG_ON(ret);
+ key.offset = btrfs_name_hash(name, name_len);
path = btrfs_alloc_path();
if (!path)
return -ENOMEM;
@@ -122,8 +121,7 @@ int btrfs_insert_dir_item(struct btrfs_t

key.objectid = dir;
btrfs_set_key_type(&key, BTRFS_DIR_ITEM_KEY);
- ret = btrfs_name_hash(name, name_len, &key.offset);
- BUG_ON(ret);
+ key.offset = btrfs_name_hash(name, name_len);
path = btrfs_alloc_path();
data_size = sizeof(*dir_item) + name_len;
dir_item = insert_with_overflow(trans, root, path, &key, data_size,
@@ -196,8 +194,7 @@ struct btrfs_dir_item *btrfs_lookup_dir_
key.objectid = dir;
btrfs_set_key_type(&key, BTRFS_DIR_ITEM_KEY);

- ret = btrfs_name_hash(name, name_len, &key.offset);
- BUG_ON(ret);
+ key.offset = btrfs_name_hash(name, name_len);

ret = btrfs_search_slot(trans, root, &key, path, ins_len, cow);
if (ret < 0)
@@ -258,8 +255,7 @@ struct btrfs_dir_item *btrfs_lookup_xatt

key.objectid = dir;
btrfs_set_key_type(&key, BTRFS_XATTR_ITEM_KEY);
- ret = btrfs_name_hash(name, name_len, &key.offset);
- BUG_ON(ret);
+ key.offset = btrfs_name_hash(name, name_len);
ret = btrfs_search_slot(trans, root, &key, path, ins_len, cow);
if (ret < 0)
return ERR_PTR(ret);
diff -up --recursive --new-file vanilla/btrfs-progs-0.12/dir-test.c btrfs-progs-0.12/dir-test.c
--- vanilla/btrfs-progs-0.12/dir-test.c 2008-02-06 08:37:45.000000000 -0800
+++ btrfs-progs-0.12/dir-test.c 2008-02-10 17:03:49.000000000 -0800
@@ -129,8 +129,8 @@ error:
struct btrfs_dir_item);
found = (char *)(di + 1);
found_len = btrfs_dir_name_len(di);
- btrfs_name_hash(buf, strlen(buf), &myhash);
- btrfs_name_hash(found, found_len, &foundhash);
+ myhash = btrfs_name_hash(buf, strlen(buf));
+ foundhash = btrfs_name_hash(found, found_len);
if (myhash != foundhash)
goto fatal_release;
btrfs_release_path(root, &path);
diff -up --recursive --new-file vanilla/btrfs-progs-0.12/hash.c btrfs-progs-0.12/hash.c
--- vanilla/btrfs-progs-0.12/hash.c 2008-02-06 08:37:45.000000000 -0800
+++ btrfs-progs-0.12/hash.c 2008-02-10 17:02:36.000000000 -0800
@@ -75,12 +75,13 @@ static void str2hashbuf(const char *msg,
*buf++ = pad;
}

-int btrfs_name_hash(const char *name, int len, u64 *hash_result)
+u64 btrfs_name_hash(const char *name, int len)
{
__u32 hash;
__u32 minor_hash = 0;
const char *p;
__u32 in[8], buf[2];
+ u64 hash_result;

/* Initialize the default seed for the hash checksum functions */
buf[0] = 0x67452301;
@@ -97,8 +98,8 @@ int btrfs_name_hash(const char *name, in
}
hash = buf[0];
minor_hash = buf[1];
- *hash_result = buf[0];
- *hash_result <<= 32;
- *hash_result |= buf[1];
- return 0;
+ hash_result = buf[0];
+ hash_result <<= 32;
+ hash_result |= buf[1];
+ return hash_result;
}
diff -up --recursive --new-file vanilla/btrfs-progs-0.12/hash.h btrfs-progs-0.12/hash.h
--- vanilla/btrfs-progs-0.12/hash.h 2008-02-06 08:37:45.000000000 -0800
+++ btrfs-progs-0.12/hash.h 2008-02-10 17:02:21.000000000 -0800
@@ -18,5 +18,5 @@

#ifndef __HASH__
#define __HASH__
-int btrfs_name_hash(const char *name, int len, u64 *hash_result);
+u64 btrfs_name_hash(const char *name, int len);
#endif
diff -up --recursive --new-file vanilla/btrfs-progs-0.12/hasher.c btrfs-progs-0.12/hasher.c
--- vanilla/btrfs-progs-0.12/hasher.c 2008-02-06 08:37:45.000000000 -0800
+++ btrfs-progs-0.12/hasher.c 2008-02-10 17:04:04.000000000 -0800
@@ -35,8 +35,7 @@ int main() {
continue;
if (line[strlen(line)-1] == '\n')
line[strlen(line)-1] = '\0';
- ret = btrfs_name_hash(line, strlen(line), &result);
- BUG_ON(ret);
+ result = btrfs_name_hash(line, strlen(line));
printf("hash returns %llu\n", (unsigned long long)result);
}
return 0;

Chris Mason

unread,

Feb 11, 2008, 8:43:48 AM2/11/08

to David Miller, linux-...@vger.kernel.org, linux-...@vger.kernel.org, btrfs...@oss.oracle.com

On Sunday 10 February 2008, David Miller wrote:
> From: Chris Mason <chris...@oracle.com>
> Date: Wed, 6 Feb 2008 12:00:13 -0500
>

> This function never returns an error, so the simplest fix was to
> return the hash value which avoids all of the issues. In attempting
> other schemes to fix this, I found it very difficult to give gcc
> a packed attribute for that "u64 *" argument other than to create
> some new pseudo structure which would have been ugly.
>

Many thanks, I clearly didn't put enough thought into the unaligned access
problems.

> Similar code lives in the btrfs kernel code too, I'll try to get a
> partition at least mounted and working minimally and if successful
> I'll send you patches for that too.

The kernel is actually worse, because the set/get macros are more complex.
Some live in ctree.h like in the progs, but the nasty ones live in
struct-funcs.c

-chris

David Miller

unread,

Feb 12, 2008, 1:43:36 AM2/12/08

to chris...@oracle.com, linux-...@vger.kernel.org, linux-...@vger.kernel.org, btrfs...@oss.oracle.com

From: Chris Mason <chris...@oracle.com>
Date: Mon, 11 Feb 2008 08:42:20 -0500

> The kernel is actually worse, because the set/get macros are more complex.
> Some live in ctree.h like in the progs, but the nasty ones live in
> struct-funcs.c

This is really problematic, because you've got these things called
"btrfs_item_ptr()" which really isn't a pointer, it's a relative
'unsigned long' offset cast to a pointer. The source of this
seems to be btrfs_leaf_data().

And then those things get passed down into the SETGET functions!

Then deeper down we have terribly inconsistent things like
btrfs_item_nr_offset() and
btrfs_item_offset_nr().

Sigh... I'll see what I can do.

David Miller

unread,

Feb 12, 2008, 3:11:14 AM2/12/08

to chris...@oracle.com, linux-...@vger.kernel.org, linux-...@vger.kernel.org, btrfs...@oss.oracle.com

From: David Miller <da...@davemloft.net>
Date: Mon, 11 Feb 2008 23:21:39 -0800 (PST)

> Filesystems like ext2 put their superblock 1 block into the partition
> in order to avoid overwriting disk labels and other uglies. UFS does
> this too, as do several others. One of the few exceptions I've been
> able to find is XFS.
>
> This is a real issue on sparc where the default sun disk labels
> created use an initial partition where block zero aliases the disk
> label. It took me a few iterations before I figured out why every
> btrfs make would zero out my disk label :-/

Actually it seems this is only a problem with mkfs.btrfs, it clears
out the first 64 4K chunks of the disk for whatever reason.

The following patch cures the disk label spamming problem for me:

--- vanilla/btrfs-progs-0.12/mkfs.c 2008-02-06 08:37:45.000000000 -0800
+++ btrfs-progs-0.12/mkfs.c 2008-02-12 00:07:43.000000000 -0800
@@ -210,7 +210,8 @@ int main(int ac, char **av)
exit(1);
}
memset(buf, 0, sectorsize);
- for(i = 0; i < 64; i++) {
+ lseek(fd, BTRFS_SUPER_INFO_OFFSET, SEEK_SET);
+ for(i = BTRFS_SUPER_INFO_OFFSET / sectorsize; i < 64; i++) {
ret = write(fd, buf, sectorsize);
if (ret != sectorsize) {
fprintf(stderr, "unable to zero fill device\n");

David Miller

unread,

Feb 12, 2008, 4:24:08 AM2/12/08

to chris...@oracle.com, linux-...@vger.kernel.org, linux-...@vger.kernel.org, btrfs...@oss.oracle.com

The CRC32C implementation in the btrfs progs is different from the one
in the kernel, so obviously nothing can possibly work on big-endian.

This is getting less and less fun by the minute, I simply wanted to
test btrfs on Niagara :-/

Here is a patch to fix that:

--- vanilla/btrfs-progs-0.12/crc32c.c 2008-02-06 08:37:45.000000000 -0800
+++ btrfs-progs-0.12/crc32c.c 2008-02-12 01:19:33.000000000 -0800
@@ -91,13 +91,11 @@ static const u32 crc32c_table[256] = {
* crc using table.
*/

-u32 crc32c_le(u32 seed, unsigned char const *data, size_t length)
+u32 crc32c_le(u32 crc, unsigned char const *data, size_t length)
{
- u32 crc = (__force __u32)(cpu_to_le32(seed));
-
while (length--)
crc =
crc32c_table[(crc ^ *data++) & 0xFFL] ^ (crc >> 8);

- return le32_to_cpu((__force __le32)crc);
+ return crc;

Chris Mason

unread,

Feb 12, 2008, 8:45:16 AM2/12/08

to David Miller, linux-...@vger.kernel.org, linux-...@vger.kernel.org, btrfs...@oss.oracle.com

On Tuesday 12 February 2008, David Miller wrote:
> From: Chris Mason <chris...@oracle.com>
> Date: Mon, 11 Feb 2008 08:42:20 -0500
>
> > The kernel is actually worse, because the set/get macros are more
> > complex. Some live in ctree.h like in the progs, but the nasty ones live
> > in struct-funcs.c
>
> This is really problematic, because you've got these things called
> "btrfs_item_ptr()" which really isn't a pointer, it's a relative
> 'unsigned long' offset cast to a pointer. The source of this
> seems to be btrfs_leaf_data().
>
> And then those things get passed down into the SETGET functions!

Explaining it won't make it pretty, but at least I can tell you what the code
does.

This is all part of the btrfs code that supports tree block sizes larger than
a page. The extent_buffer code (extent_io.c) provides a read/write api into
an extent_buffer based on offsets from the start of the multi-page buffer.
That's where the relative unsigned long comes from.

The part where I cast it to pointers is me trying to maintain type checking
throughout the code. The pointers themselves are useless, they need to be
matched with an extent_buffer to actually get to the bytes.

There are a few parts where the SETGET funcs are open coded, mostly in very
performance critical functions. Just look for lexxx_to_cpu

>
> Then deeper down we have terribly inconsistent things like
> btrfs_item_nr_offset() and
> btrfs_item_offset_nr().

Btree blocks have the offset of the item header from the start of the block
and the offset of the item data. And, I'm very bad at naming.

>
> Sigh... I'll see what I can do.

Thanks

-chris

Chris Mason

unread,

Feb 12, 2008, 8:51:00 AM2/12/08

to David Miller, linux-...@vger.kernel.org, linux-...@vger.kernel.org, btrfs...@oss.oracle.com

On Tuesday 12 February 2008, David Miller wrote:

> From: David Miller <da...@davemloft.net>
> Date: Mon, 11 Feb 2008 23:21:39 -0800 (PST)
>
> > Filesystems like ext2 put their superblock 1 block into the partition
> > in order to avoid overwriting disk labels and other uglies. UFS does
> > this too, as do several others. One of the few exceptions I've been
> > able to find is XFS.
> >
> > This is a real issue on sparc where the default sun disk labels
> > created use an initial partition where block zero aliases the disk
> > label. It took me a few iterations before I figured out why every
> > btrfs make would zero out my disk label :-/
>
> Actually it seems this is only a problem with mkfs.btrfs, it clears
> out the first 64 4K chunks of the disk for whatever reason.

It is a good idea to remove supers from other filesystems. I also need to add
zeroing at the end of the device as well.

Looks like I misread the e2fs zeroing code. It zeros the whole external log
device, and I assumed it also zero'd out the start of the main FS.

So, if Btrfs starts zeroing at 1k, will that be acceptable for you?

-chris

Jan Engelhardt

unread,

Feb 12, 2008, 9:01:14 AM2/12/08

to Chris Mason, David Miller, linux-...@vger.kernel.org, linux-...@vger.kernel.org, btrfs...@oss.oracle.com

On Feb 12 2008 08:49, Chris Mason wrote:
>> >
>> > This is a real issue on sparc where the default sun disk labels
>> > created use an initial partition where block zero aliases the disk
>> > label. It took me a few iterations before I figured out why every
>> > btrfs make would zero out my disk label :-/
>>
>> Actually it seems this is only a problem with mkfs.btrfs, it clears
>> out the first 64 4K chunks of the disk for whatever reason.
>
>It is a good idea to remove supers from other filesystems. I also need to add
>zeroing at the end of the device as well.
>
>Looks like I misread the e2fs zeroing code. It zeros the whole external log
>device, and I assumed it also zero'd out the start of the main FS.
>
>So, if Btrfs starts zeroing at 1k, will that be acceptable for you?

Something looks wrong here. Why would btrfs need to zero at all?
Superblock at 0, and done. Just like xfs.
(Yes, I had xfs on sparc before, so it's not like you NEED the
whitespace at the start of a partition.)

Chris Mason

unread,

Feb 12, 2008, 9:10:24 AM2/12/08

to Jan Engelhardt, David Miller, linux-...@vger.kernel.org, linux-...@vger.kernel.org, btrfs...@oss.oracle.com

On Tuesday 12 February 2008, Jan Engelhardt wrote:
> On Feb 12 2008 08:49, Chris Mason wrote:
> >> > This is a real issue on sparc where the default sun disk labels
> >> > created use an initial partition where block zero aliases the disk
> >> > label. It took me a few iterations before I figured out why every
> >> > btrfs make would zero out my disk label :-/
> >>
> >> Actually it seems this is only a problem with mkfs.btrfs, it clears
> >> out the first 64 4K chunks of the disk for whatever reason.
> >
> >It is a good idea to remove supers from other filesystems. I also need to
> > add zeroing at the end of the device as well.
> >
> >Looks like I misread the e2fs zeroing code. It zeros the whole external
> > log device, and I assumed it also zero'd out the start of the main FS.
> >
> >So, if Btrfs starts zeroing at 1k, will that be acceptable for you?
>
> Something looks wrong here. Why would btrfs need to zero at all?
> Superblock at 0, and done. Just like xfs.
> (Yes, I had xfs on sparc before, so it's not like you NEED the
> whitespace at the start of a partition.)

I've had requests to move the super down to 64k to make room for bootloaders,
which may not matter for sparc, but I don't really plan on different
locations for different arches.

4k aligned is important given that sector sizes are growing.

-chris

Jan Engelhardt

unread,

Feb 12, 2008, 9:22:37 AM2/12/08

to Chris Mason, David Miller, linux-...@vger.kernel.org, linux-...@vger.kernel.org, btrfs...@oss.oracle.com

On Feb 12 2008 09:08, Chris Mason wrote:
>> >
>> >So, if Btrfs starts zeroing at 1k, will that be acceptable for you?
>>
>> Something looks wrong here. Why would btrfs need to zero at all?
>> Superblock at 0, and done. Just like xfs.
>> (Yes, I had xfs on sparc before, so it's not like you NEED the
>> whitespace at the start of a partition.)
>
>I've had requests to move the super down to 64k to make room for bootloaders,
>which may not matter for sparc, but I don't really plan on different
>locations for different arches.

In x86, there is even more space for a bootloader (some 28k or so)
even if your partition table is as closely packed as possible,
from 0 to 7e00 IIRC.

For sparc you could have something like

startlba endlba type
sda1 0 2 1 Boot
sda2 2 58 3 Whole disk
sda3 58 90000 83 Linux

and slap the bootloader into "MBR", just like on x86.
Or I am missing something..

Chris Mason

unread,

Feb 12, 2008, 9:36:46 AM2/12/08

to Jan Engelhardt, David Miller, linux-...@vger.kernel.org, linux-...@vger.kernel.org, btrfs...@oss.oracle.com

On Tuesday 12 February 2008, Jan Engelhardt wrote:

> On Feb 12 2008 09:08, Chris Mason wrote:
> >> >So, if Btrfs starts zeroing at 1k, will that be acceptable for you?
> >>
> >> Something looks wrong here. Why would btrfs need to zero at all?
> >> Superblock at 0, and done. Just like xfs.
> >> (Yes, I had xfs on sparc before, so it's not like you NEED the
> >> whitespace at the start of a partition.)
> >
> >I've had requests to move the super down to 64k to make room for
> > bootloaders, which may not matter for sparc, but I don't really plan on
> > different locations for different arches.
>
> In x86, there is even more space for a bootloader (some 28k or so)
> even if your partition table is as closely packed as possible,
> from 0 to 7e00 IIRC.
>
> For sparc you could have something like
>
> startlba endlba type
> sda1 0 2 1 Boot
> sda2 2 58 3 Whole disk
> sda3 58 90000 83 Linux
>
> and slap the bootloader into "MBR", just like on x86.
> Or I am missing something..

It was a request from hpa, and he clearly had something in mind. He kindly
offered to review the disk format for bootloaders and other lower level
issues but I asked him to wait until I firm it up a bit.

From my point of view, 0 is a bad idea because it is very likely to conflict
with other things. There are lots of things in the FS that need deep
thought,and the perfect system to fully use the first 64k of a 1TB filesystem
isn't quite at the top of my list right now ;)

Regardless of offset, it is a good idea to mop up previous filesystems where
possible, and a very good idea to align things on some sector boundary. Even
going 1MB in wouldn't be a horrible idea to align with erasure blocks on SSD.

-chris

Jan Engelhardt

unread,

Feb 12, 2008, 10:05:30 AM2/12/08

to Chris Mason, David Miller, linux-...@vger.kernel.org, linux-...@vger.kernel.org, btrfs...@oss.oracle.com

>
On Feb 12 2008 09:35, Chris Mason wrote:
>>
>> and slap the bootloader into "MBR", just like on x86.
>> Or I am missing something..
>
>It was a request from hpa, and he clearly had something in mind. He kindly
>offered to review the disk format for bootloaders and other lower level
>issues but I asked him to wait until I firm it up a bit.
>
>From my point of view, 0 is a bad idea because it is very likely to conflict
>with other things. There are lots of things in the FS that need deep
>thought,and the perfect system to fully use the first 64k of a 1TB filesystem
>isn't quite at the top of my list right now ;)
>
>Regardless of offset, it is a good idea to mop up previous filesystems where
>possible, and a very good idea to align things on some sector boundary. Even
>going 1MB in wouldn't be a horrible idea to align with erasure blocks on SSD.

I still don't like the idea of btrfs trying to be smarter than a user
who can partition up his system according to
(a) his likes
(b) system or hardware requirements or recommendations
to align the superblock to a specific location.

1MB alignment does not always mean 1MB alignment.
Sector 1 begins at 0x7e00 on x86.
And with the maximum CHS geometry (255/63), partitions begin
at 0x7e00+n*8225280 bytes, so the SB is unlikely to ever be on
a 1048576 boundary.

Chris Mason

unread,

Feb 12, 2008, 11:19:34 AM2/12/08

to Jan Engelhardt, David Miller, linux-...@vger.kernel.org, linux-...@vger.kernel.org, btrfs...@oss.oracle.com

On Tuesday 12 February 2008, Jan Engelhardt wrote:

> On Feb 12 2008 09:35, Chris Mason wrote:
> >> and slap the bootloader into "MBR", just like on x86.
> >> Or I am missing something..
> >
> >It was a request from hpa, and he clearly had something in mind. He
> > kindly offered to review the disk format for bootloaders and other lower
> > level issues but I asked him to wait until I firm it up a bit.
> >
> >From my point of view, 0 is a bad idea because it is very likely to
> > conflict with other things. There are lots of things in the FS that need
> > deep thought,and the perfect system to fully use the first 64k of a 1TB
> > filesystem isn't quite at the top of my list right now ;)
> >
> >Regardless of offset, it is a good idea to mop up previous filesystems
> > where possible, and a very good idea to align things on some sector
> > boundary. Even going 1MB in wouldn't be a horrible idea to align with
> > erasure blocks on SSD.
>
> I still don't like the idea of btrfs trying to be smarter than a user
> who can partition up his system according to
> (a) his likes
> (b) system or hardware requirements or recommendations
> to align the superblock to a specific location.

Will all the users in the world who think about super block location when they
partition their disks please raise their hands?

The location of the super block needs to be very simple in order for mount and
friends to find and detect it. It needs a simple algorithm to try multiple
locations in case a given copy of the super is corrupt.

Design in this case is a bunch of compromises around other users of the
hardware, ease of programming, and the benefits in performance or usability
from doing something complex.

>
> 1MB alignment does not always mean 1MB alignment.
> Sector 1 begins at 0x7e00 on x86.
> And with the maximum CHS geometry (255/63), partitions begin
> at 0x7e00+n*8225280 bytes, so the SB is unlikely to ever be on
> a 1048576 boundary.

IO is already aligned on sectors, sometimes we'll have a perfect erasure block
alignment and sometimes not. When the location of the super is my biggest
bottleneck, I'll be a very happy boy.

-chris

David Miller

unread,

Feb 12, 2008, 3:50:32 PM2/12/08

to chris...@oracle.com, linux-...@vger.kernel.org, linux-...@vger.kernel.org, btrfs...@oss.oracle.com

From: Chris Mason <chris...@oracle.com>
Date: Tue, 12 Feb 2008 08:49:34 -0500

> So, if Btrfs starts zeroing at 1k, will that be acceptable for you?

Sure.

David Miller

unread,

Feb 12, 2008, 4:55:46 PM2/12/08

to chris...@oracle.com, linux-...@vger.kernel.org, linux-...@vger.kernel.org, btrfs...@oss.oracle.com

From: Chris Mason <chris...@oracle.com>

Date: Wed, 6 Feb 2008 12:00:13 -0500

> So, here's v0.12.

Any page size larger than 4K will not work with btrfs. All of the
extent stuff assumes that PAGE_SIZE <= sectorsize.

I confirmed this by forcing mkfs.btrfs to use an 8K sectorsize on
sparc64 and I was finally able to successfully mount a partition.

With 4K there are zero's in the root tree node header, because it's
extent's location on disk is at a sub-PAGE_SIZE multiple and the
extent code doesn't handle that.

You really need to start validating this stuff on other platforms.
Something that isn't little endian and something that doesn't use 4K
pages. I'm sure you have some powerpc parts around somewhere. :)

Anyways, here is a patch for the kernel bits which fixes most of the
unaligned accesses on sparc64.

diff -u --recursive --new-file vanilla/btrfs-0.12/ctree.h btrfs-0.12/ctree.h
--- vanilla/btrfs-0.12/ctree.h 2008-02-06 08:37:39.000000000 -0800
+++ btrfs-0.12/ctree.h 2008-02-10 17:17:49.000000000 -0800
@@ -495,22 +495,17 @@
#define BTRFS_SETGET_HEADER_FUNCS(name, type, member, bits) \
static inline u##bits btrfs_##name(struct extent_buffer *eb) \
{ \
- char *kaddr = kmap_atomic(eb->first_page, KM_USER0); \
- unsigned long offset = offsetof(type, member); \
- u##bits res; \
- __le##bits *tmp = (__le##bits *)(kaddr + offset); \
- res = le##bits##_to_cpu(*tmp); \
- kunmap_atomic(kaddr, KM_USER0); \
+ type *p = kmap_atomic(eb->first_page, KM_USER0); \
+ u##bits res = le##bits##_to_cpu(p->member); \
+ kunmap_atomic(p, KM_USER0); \
return res; \

} \
static inline void btrfs_set_##name(struct extent_buffer *eb, \

u##bits val) \
{ \
- char *kaddr = kmap_atomic(eb->first_page, KM_USER0); \
- unsigned long offset = offsetof(type, member); \
- __le##bits *tmp = (__le##bits *)(kaddr + offset); \
- *tmp = cpu_to_le##bits(val); \
- kunmap_atomic(kaddr, KM_USER0); \
+ type *p = kmap_atomic(eb->first_page, KM_USER0); \

+ p->member = cpu_to_le##bits(val); \

+ kunmap_atomic(p, KM_USER0); \

}

#define BTRFS_SETGET_STACK_FUNCS(name, type, member, bits) \

diff -u --recursive --new-file vanilla/btrfs-0.12/dir-item.c btrfs-0.12/dir-item.c
--- vanilla/btrfs-0.12/dir-item.c 2008-02-06 08:37:39.000000000 -0800
+++ btrfs-0.12/dir-item.c 2008-02-10 17:20:00.000000000 -0800
@@ -71,8 +71,7 @@

key.objectid = dir;
btrfs_set_key_type(&key, BTRFS_XATTR_ITEM_KEY);
- ret = btrfs_name_hash(name, name_len, &key.offset);
- BUG_ON(ret);
+ key.offset = btrfs_name_hash(name, name_len);
path = btrfs_alloc_path();
if (!path)
return -ENOMEM;

@@ -125,8 +124,7 @@

key.objectid = dir;
btrfs_set_key_type(&key, BTRFS_DIR_ITEM_KEY);
- ret = btrfs_name_hash(name, name_len, &key.offset);
- BUG_ON(ret);
+ key.offset = btrfs_name_hash(name, name_len);
path = btrfs_alloc_path();
data_size = sizeof(*dir_item) + name_len;
dir_item = insert_with_overflow(trans, root, path, &key, data_size,

@@ -199,8 +197,7 @@

key.objectid = dir;
btrfs_set_key_type(&key, BTRFS_DIR_ITEM_KEY);

- ret = btrfs_name_hash(name, name_len, &key.offset);
- BUG_ON(ret);
+ key.offset = btrfs_name_hash(name, name_len);

ret = btrfs_search_slot(trans, root, &key, path, ins_len, cow);
if (ret < 0)

@@ -261,8 +258,7 @@

key.objectid = dir;
btrfs_set_key_type(&key, BTRFS_XATTR_ITEM_KEY);
- ret = btrfs_name_hash(name, name_len, &key.offset);
- BUG_ON(ret);
+ key.offset = btrfs_name_hash(name, name_len);
ret = btrfs_search_slot(trans, root, &key, path, ins_len, cow);
if (ret < 0)
return ERR_PTR(ret);

diff -u --recursive --new-file vanilla/btrfs-0.12/hash.c btrfs-0.12/hash.c
--- vanilla/btrfs-0.12/hash.c 2008-02-06 08:37:39.000000000 -0800
+++ btrfs-0.12/hash.c 2008-02-10 17:19:19.000000000 -0800
@@ -76,19 +76,18 @@

*buf++ = pad;
}

-int btrfs_name_hash(const char *name, int len, u64 *hash_result)
+u64 btrfs_name_hash(const char *name, int len)
{
__u32 hash;
__u32 minor_hash = 0;
const char *p;
__u32 in[8], buf[2];
+ u64 hash_result;

if (len == 1 && *name == '.') {
- *hash_result = 1;
- return 0;
+ return 1;
} else if (len == 2 && name[0] == '.' && name[1] == '.') {
- *hash_result = 2;
- return 0;
+ return 2;

}

/* Initialize the default seed for the hash checksum functions */

@@ -106,8 +105,8 @@

}
hash = buf[0];
minor_hash = buf[1];
- *hash_result = buf[0];
- *hash_result <<= 32;
- *hash_result |= buf[1];
- return 0;
+ hash_result = buf[0];
+ hash_result <<= 32;
+ hash_result |= buf[1];
+ return hash_result;
}

diff -u --recursive --new-file vanilla/btrfs-0.12/hash.h btrfs-0.12/hash.h
--- vanilla/btrfs-0.12/hash.h 2008-02-06 08:37:39.000000000 -0800
+++ btrfs-0.12/hash.h 2008-02-10 17:19:25.000000000 -0800

@@ -18,5 +18,5 @@

#ifndef __HASH__
#define __HASH__
-int btrfs_name_hash(const char *name, int len, u64 *hash_result);
+u64 btrfs_name_hash(const char *name, int len);
#endif

diff -u --recursive --new-file vanilla/btrfs-0.12/struct-funcs.c btrfs-0.12/struct-funcs.c
--- vanilla/btrfs-0.12/struct-funcs.c 2008-02-06 08:37:39.000000000 -0800
+++ btrfs-0.12/struct-funcs.c 2008-02-11 22:50:46.000000000 -0800
@@ -21,16 +21,15 @@

u##bits btrfs_##name(struct extent_buffer *eb, \
type *s) \
{ \
- unsigned long offset = (unsigned long)s + \
- offsetof(type, member); \

- __le##bits *tmp; \
+ unsigned long part_offset = (unsigned long)s; \
+ unsigned long offset = part_offset + offsetof(type, member); \
+ type *p; \
/* ugly, but we want the fast path here */ \
if (eb->map_token && offset >= eb->map_start && \
offset + sizeof(((type *)0)->member) <= eb->map_start + \
eb->map_len) { \
- tmp = (__le##bits *)(eb->kaddr + offset - \
- eb->map_start); \
- return le##bits##_to_cpu(*tmp); \
+ p = (type *)(eb->kaddr + part_offset - eb->map_start); \
+ return le##bits##_to_cpu(p->member); \
} \
{ \
int err; \
@@ -48,8 +47,8 @@
read_eb_member(eb, s, type, member, &res); \
return le##bits##_to_cpu(res); \
} \
- tmp = (__le##bits *)(kaddr + offset - map_start); \
- res = le##bits##_to_cpu(*tmp); \
+ p = (type *)(kaddr + part_offset - map_start); \
+ res = le##bits##_to_cpu(p->member); \
if (unmap_on_exit) \
unmap_extent_buffer(eb, map_token, KM_USER1); \
return res; \
@@ -58,16 +57,15 @@

void btrfs_set_##name(struct extent_buffer *eb, \
type *s, u##bits val) \
{ \
- unsigned long offset = (unsigned long)s + \
- offsetof(type, member); \

- __le##bits *tmp; \
+ unsigned long part_offset = (unsigned long)s; \
+ unsigned long offset = part_offset + offsetof(type, member); \
+ type *p; \
/* ugly, but we want the fast path here */ \
if (eb->map_token && offset >= eb->map_start && \
offset + sizeof(((type *)0)->member) <= eb->map_start + \
eb->map_len) { \
- tmp = (__le##bits *)(eb->kaddr + offset - \
- eb->map_start); \
- *tmp = cpu_to_le##bits(val); \
+ p = (type *)(eb->kaddr + part_offset - eb->map_start); \

+ p->member = cpu_to_le##bits(val); \

return; \
} \
{ \
@@ -86,8 +84,8 @@
write_eb_member(eb, s, type, member, &val); \
return; \
} \
- tmp = (__le##bits *)(kaddr + offset - map_start); \
- *tmp = cpu_to_le##bits(val); \
+ p = (type *)(kaddr + part_offset - map_start); \

+ p->member = cpu_to_le##bits(val); \

if (unmap_on_exit) \
unmap_extent_buffer(eb, map_token, KM_USER1); \
} \

Chris Mason

unread,

Feb 12, 2008, 5:06:16 PM2/12/08

to David Miller, linux-...@vger.kernel.org, linux-...@vger.kernel.org, btrfs...@oss.oracle.com

On Tuesday 12 February 2008, David Miller wrote:
> From: Chris Mason <chris...@oracle.com>
> Date: Wed, 6 Feb 2008 12:00:13 -0500
>
> > So, here's v0.12.
>
> Any page size larger than 4K will not work with btrfs. All of the
> extent stuff assumes that PAGE_SIZE <= sectorsize.

Yeah, there is definitely clean up to do in that area.

>
> I confirmed this by forcing mkfs.btrfs to use an 8K sectorsize on
> sparc64 and I was finally able to successfully mount a partition.

Nice

>
> With 4K there are zero's in the root tree node header, because it's
> extent's location on disk is at a sub-PAGE_SIZE multiple and the
> extent code doesn't handle that.
>
> You really need to start validating this stuff on other platforms.
> Something that isn't little endian and something that doesn't use 4K
> pages. I'm sure you have some powerpc parts around somewhere. :)

Grin, I think around v0.4 I grabbed a ppc box for a day and got things
working. There has been some churn since then...

My first prio is the newest set of disk format changes, and then I'll sit down
and work on stability on a bunch of arches.

>
> Anyways, here is a patch for the kernel bits which fixes most of the
> unaligned accesses on sparc64.

Many thanks, I'll try these out here and push them into the tree.

-chris

David Miller

unread,

Feb 12, 2008, 6:26:47 PM2/12/08

to jen...@computergmbh.de, chris...@oracle.com, linux-...@vger.kernel.org, linux-...@vger.kernel.org, btrfs...@oss.oracle.com

From: Jan Engelhardt <jen...@computergmbh.de>
Date: Tue, 12 Feb 2008 15:00:20 +0100 (CET)

> (Yes, I had xfs on sparc before, so it's not like you NEED the
> whitespace at the start of a partition.)

You actully do unless you want to lose significant chunks of your disk
space.

The Sun disk label only allows you to specify the start of a partition
in cylinders, so if you want to use a filesystem like XFS you have to
start the partition on cylinder 1 which can be many blocks into the
disk. That entire first cylinder is completely wasted.

What XFS does by putting the superblock at zero is simply does not
take these kinds of issues into consideration.

David Miller

unread,

Feb 12, 2008, 6:28:29 PM2/12/08

to jen...@computergmbh.de, chris...@oracle.com, linux-...@vger.kernel.org, linux-...@vger.kernel.org, btrfs...@oss.oracle.com

From: Jan Engelhardt <jen...@computergmbh.de>
Date: Tue, 12 Feb 2008 15:00:20 +0100 (CET)

> Something looks wrong here. Why would btrfs need to zero at all?

So that existing superblocks on the partition won't
be interpreted as correct by other filesystems. It's
a safety measure many mkfs programs use.

> Superblock at 0, and done. Just like xfs.

No, we won't do stupid things like that and make an entire
cylinder of our disks unusable. See my other reply.

David Miller

unread,

Feb 12, 2008, 6:34:54 PM2/12/08

to jen...@computergmbh.de, chris...@oracle.com, linux-...@vger.kernel.org, linux-...@vger.kernel.org, btrfs...@oss.oracle.com

From: Jan Engelhardt <jen...@computergmbh.de>
Date: Tue, 12 Feb 2008 15:21:52 +0100 (CET)

> For sparc you could have something like
>
> startlba endlba type
> sda1 0 2 1 Boot
> sda2 2 58 3 Whole disk
> sda3 58 90000 83 Linux
>
> and slap the bootloader into "MBR", just like on x86.
> Or I am missing something..

You cannot specify partitions on LBA boundaries on sparc, the Sun disk
label specifies the partition start points in cylinders.

David Miller

unread,

Feb 12, 2008, 6:35:55 PM2/12/08

to chris...@oracle.com, jen...@computergmbh.de, linux-...@vger.kernel.org, linux-...@vger.kernel.org, btrfs...@oss.oracle.com

From: Chris Mason <chris...@oracle.com>
Date: Tue, 12 Feb 2008 09:35:20 -0500

> From my point of view, 0 is a bad idea because it is very likely to
> conflict with other things.

Starting at 0 is a bad idea because otherwise you'll waste
significant chunks of your disk on Sparc because of reasons
I've outlined in other replies.

What XFS does is really unfortunate, let's learn from it's
mistake.

David Miller

unread,

Feb 12, 2008, 6:38:34 PM2/12/08

to jen...@computergmbh.de, chris...@oracle.com, linux-...@vger.kernel.org, linux-...@vger.kernel.org, btrfs...@oss.oracle.com

From: Jan Engelhardt <jen...@computergmbh.de>
Date: Tue, 12 Feb 2008 16:04:52 +0100 (CET)

> I still don't like the idea of btrfs trying to be smarter than a user
> who can partition up his system according to
> (a) his likes
> (b) system or hardware requirements or recommendations
> to align the superblock to a specific location.

All of your beliefs are unfortunately without the understanding
of restrictions that exist in several partition layouts such
as the Sun disk label one.

You have to start the superblock somewhere other than zero or
else you lose a huge chunk of your disk, and furthermore a
zero based partition is what all of the Sun disk label
creating programs make by default.

"I make a default disk label, I put btrfs or XFS on there, my disk
label is gone."

Real intuitive.

Jan Engelhardt

unread,

Feb 12, 2008, 6:39:44 PM2/12/08

to David Miller, chris...@oracle.com, linux-...@vger.kernel.org, linux-...@vger.kernel.org, btrfs...@oss.oracle.com

On Feb 12 2008 15:26, David Miller wrote:
>
>> (Yes, I had xfs on sparc before, so it's not like you NEED the
>> whitespace at the start of a partition.)
>
>You actully do unless you want to lose significant chunks of your disk
>space.
>
>The Sun disk label only allows you to specify the start of a partition
>in cylinders, so if you want to use a filesystem like XFS you have to
>start the partition on cylinder 1 which can be many blocks into the
>disk. That entire first cylinder is completely wasted.

Ok you do have a point there. The GPT users win of course, since it
uses LBA, not cyls, so the number of lost bytes is generally below
a cyl.
On the other hand, the H and S of CHS could be lowered and S increased,
e.g. divide H by 2, divide S by 2, multiply S by 4. This gives a finer
bytes/cylinder granularity.

>What XFS does by putting the superblock at zero is simply does not
>take these kinds of issues into consideration.
>

Well it was designed for a different system initially with
a different style of booting.

Jan Engelhardt

unread,

Feb 12, 2008, 6:43:29 PM2/12/08

to David Miller, chris...@oracle.com, linux-...@vger.kernel.org, linux-...@vger.kernel.org, btrfs...@oss.oracle.com

On Feb 12 2008 15:38, David Miller wrote:
>
>> I still don't like the idea of btrfs trying to be smarter than a user
>> who can partition up his system according to
>> (a) his likes
>> (b) system or hardware requirements or recommendations
>> to align the superblock to a specific location.
>
>All of your beliefs are unfortunately without the understanding
>of restrictions that exist in several partition layouts such
>as the Sun disk label one.

(Time for Sun to switch!)

>You have to start the superblock somewhere other than zero or
>else you lose a huge chunk of your disk, and furthermore a
>zero based partition is what all of the Sun disk label
>creating programs make by default.
>
>"I make a default disk label, I put btrfs or XFS on there, my disk
> label is gone."
>
>Real intuitive.
>

x86 MSDOS partition table layout starts counting with sector 1, which
is (not so intuitively) starting at 0x7e00 (and there's no sector 0,
probably for safety). Well, each ptable format with its own quirks.

Theodore Tso

unread,

Feb 12, 2008, 7:48:12 PM2/12/08

to David Miller, jen...@computergmbh.de, chris...@oracle.com, linux-...@vger.kernel.org, linux-...@vger.kernel.org, btrfs...@oss.oracle.com

On Tue, Feb 12, 2008 at 03:28:26PM -0800, David Miller wrote:
> From: Jan Engelhardt <jen...@computergmbh.de>
> Date: Tue, 12 Feb 2008 15:00:20 +0100 (CET)
>
> > Something looks wrong here. Why would btrfs need to zero at all?
>
> So that existing superblocks on the partition won't
> be interpreted as correct by other filesystems. It's
> a safety measure many mkfs programs use.
>
> > Superblock at 0, and done. Just like xfs.
>
> No, we won't do stupid things like that and make an entire
> cylinder of our disks unusable. See my other reply.

The reason why we don't put the superblock at 0 is not because it
screws over the sparc, but because on many systems (including x86) the
bootsector is stored at 0. It's not hard for mke2fs to zap the boot
sector which we do on all architectures *except* sparc, to avoid
nuking the disk label. (Chris just missed the "#ifndef __sparc //
#define ZAP_BOOTBLOCK // #endif" at the beginning of mke2fs.c)

This is the best of all words; it makes sparc happy; it allows boot
loaders to put the x86 standard initial stage 0 boot loader in the
first 446 bytes of the disk; and by zapping sector 0 on all
architectures except the sparc, it solves the previous filesystem
"ghost traces" detection problem for filesystems like xfs that put the
superblock at 0.

- Ted

David Miller

unread,

Feb 12, 2008, 8:07:59 PM2/12/08

to jen...@computergmbh.de, chris...@oracle.com, linux-...@vger.kernel.org, linux-...@vger.kernel.org, btrfs...@oss.oracle.com

From: Jan Engelhardt <jen...@computergmbh.de>
Date: Wed, 13 Feb 2008 00:39:16 +0100 (CET)

> On the other hand, the H and S of CHS could be lowered and S increased,
> e.g. divide H by 2, divide S by 2, multiply S by 4. This gives a finer
> bytes/cylinder granularity.

That's really not an option when you only have 16-bits in the disk
label for some of these values, which is the case for Sun disk
labels and some others (f.e. BSD).

Jan, please get past this, it's a real issue and we therefore have to
skip the initial 1K or so in order to provide something even
approaching sane.

No modern filesystem should put it's superblock at position zero.

David Miller

unread,

Feb 12, 2008, 8:09:18 PM2/12/08

to jen...@computergmbh.de, chris...@oracle.com, linux-...@vger.kernel.org, linux-...@vger.kernel.org, btrfs...@oss.oracle.com

From: Jan Engelhardt <jen...@computergmbh.de>
Date: Wed, 13 Feb 2008 00:42:56 +0100 (CET)

>
> On Feb 12 2008 15:38, David Miller wrote:
> >
> >> I still don't like the idea of btrfs trying to be smarter than a user
> >> who can partition up his system according to
> >> (a) his likes
> >> (b) system or hardware requirements or recommendations
> >> to align the superblock to a specific location.
> >
> >All of your beliefs are unfortunately without the understanding
> >of restrictions that exist in several partition layouts such
> >as the Sun disk label one.
>
> (Time for Sun to switch!)

You can go update every sparc machine's firmware then, you
can't boot off of a device that doesn't use a sun disk
label.

Your expectations are unrealistic, and it's just simpler to
do something sane and follow the majority of existing practice
by putting the superblock a few blocks into the partition.

Rene Herman

unread,

Feb 12, 2008, 8:21:36 PM2/12/08

to Jan Engelhardt, David Miller, chris...@oracle.com, linux-...@vger.kernel.org, linux-...@vger.kernel.org, btrfs...@oss.oracle.com

On 13-02-08 00:42, Jan Engelhardt wrote:

> x86 MSDOS partition table layout starts counting with sector 1, which is
> (not so intuitively) starting at 0x7e00 (and there's no sector 0,
> probably for safety). Well, each ptable format with its own quirks.

I haven't followed this thread, but in case it matters -- this sounds fairly
confused.

Not sure what you're saying, but the MSDOS partition table has its root
table in the very first sector on the disk, at offset 0x1be = 0x200 - 4 *
sizeof(struct partition) - 2 (that is, 4 entries at the end of that first
sector, followed by a 2-byte signature).

That 0x7e00 that you are speaking of sounds somewhat like the _memory_
address the BIOS loads that first sector to: 0x7c00. It then jumps there to
start the ball rolling but 0x7c00 is not an on-disk reality or anything.

MS-DOS partition tables are furthermore fully outside the actual partitions
themselves and as such I believe not all together relevant to the issue? (as
said, not following along though...)

Rene

Bryan Henderson

unread,

Feb 12, 2008, 8:26:47 PM2/12/08

to David Miller, btrfs...@oss.oracle.com, chris...@oracle.com, jen...@computergmbh.de, linux-...@vger.kernel.org, linux-...@vger.kernel.org

> The Sun disk label only allows you to specify the start of a partition
> in cylinders, so if you want to use a filesystem like XFS you have to
> start the partition on cylinder 1 which can be many blocks into the
> disk. That entire first cylinder is completely wasted.

I don't believe a cylinder of wasted disk space is significant. I don't
believe a cylinder of disk is worth adding the complexity of sharing a
partition to a filesystem. That complexity translates into engineering
time and mistakes.

Your other arguments for making a hole in the filesystem, based on
tradition, are more convincing.

--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems

Jeff Garzik

unread,

Feb 12, 2008, 9:10:47 PM2/12/08

to David Miller, chris...@oracle.com, jen...@computergmbh.de, linux-...@vger.kernel.org, linux-...@vger.kernel.org, btrfs...@oss.oracle.com

David Miller wrote:
> From: Chris Mason <chris...@oracle.com>

> Date: Tue, 12 Feb 2008 09:08:59 -0500

>
>> I've had requests to move the super down to 64k to make room for
>> bootloaders, which may not matter for sparc, but I don't really plan
>> on different locations for different arches.
>

> The Sun disk label sits in the first 512 bytes and the boot loader
> block sits in the second 512 bytes.
>
> I think leaving even more space is a good idea for several reasons.

Yep. I chose 32K unused space in the prototype filesystem I wrote [1,
2.4 era]. I'm pretty sure I got that number from some other filesystem,
maybe even some NTFS incarnation. It's just good practice to avoid the
first and last "chunks" of a partition, FSVO chunk.

Jeff

[1] http://kernel.org/pub/linux/kernel/people/jgarzik/ibu/

Christoph Hellwig

unread,

Feb 13, 2008, 2:03:04 AM2/13/08

to David Miller, chris...@oracle.com, jen...@computergmbh.de, linux-...@vger.kernel.org, linux-...@vger.kernel.org, btrfs...@oss.oracle.com

On Tue, Feb 12, 2008 at 03:35:57PM -0800, David Miller wrote:
> What XFS does is really unfortunate, let's learn from it's
> mistake.

I'd rather say what Sun did with their disklabels was rather unfortunate :)
But yeah, new filesystem should cater for it's braindamage because it
doesn't have any kind of runtime cost at all. XFS was designed for
IRIX back then which had disklabels just like the SUN ones, just without
the braindamage of having the disklabel inside the partition..

Szabolcs Szakacsits

unread,

Feb 13, 2008, 7:51:38 PM2/13/08

to Jeff Garzik, David Miller, chris...@oracle.com, jen...@computergmbh.de, linux-...@vger.kernel.org, linux-...@vger.kernel.org, btrfs...@oss.oracle.com

On Tue, 12 Feb 2008, Jeff Garzik wrote:
>
> Yep. I chose 32K unused space in the prototype filesystem I wrote [1, 2.4
> era]. I'm pretty sure I got that number from some other filesystem, maybe
> even some NTFS incarnation.

NTFS superblock (and the partial mirror copy) can be anywhere except in the
first blocks. That space is where the $BOOT file is placed which contains
the bootstrap code and the BIOS Paramether Block which includes the NTFS
signature and describes various filesystem information needed to locate the
superblock, etc.

Unlike mkfs.xfs which warns since at least 2002 and requires the -f option
to override Sun disklabels, at the moment mkfs.ntfs will indeed destroy
them.

Thank you for the bug report and let's hope the next generation of Sun
hardwares won't scatter the firmware too into random places inside a
partition encoded by a fictitious size of disk cylinder.

Szaka