Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

New filesystem for Linux

1 view
Skip to first unread message

Mikulas Patocka

unread,
Nov 2, 2006, 4:53:14 PM11/2/06
to linux-...@vger.kernel.org
Hi

As my PhD thesis, I am designing and writing a filesystem, and it's now in
a state that it can be released. You can download it from
http://artax.karlin.mff.cuni.cz/~mikulas/spadfs/

It has some new features, such as keeping inode information directly in
directory (until you create hardlink) so that ls -la doesn't seek much,
new method to keep data consistent in case of crashes (instead of
journaling), free space is organized in lists of free runs and converted
to bitmap only in case of extreme fragmentation.

It is not very widely tested, so if you want, test it.

I have these questions:

* There is a rw semaphore that is locked for read for nearly all
operations and locked for write only rarely. However locking for read
causes cache line pingpong on SMP systems. Do you have an idea how to make
it better?

It could be improved by making a semaphore for each CPU and locking for
read only the CPU's semaphore and for write all semaphores. Or is there a
better method?

* This leads to another observation --- on i386 locking a semaphore is 2
instructions, on x86_64 it is a call to two nested functions. Has it some
reason or was it just implementator's laziness? Given the fact that locked
instruction takes 16 ticks on Opteron (and can overlap about 2 ticks with
other instructions), it would make sense to have optimized semaphores too.

* How to implement ordered-data consistency? That would mean that on
internal sync event, I'd have to flush all pages of a files that were
extended. I could scan all dirty inodes and find pages to flush --- what
kernel function would you recommend for doing it? Currently I call only
sync_blockdev which doesn't touch buffers attached to pages.

Mikulas
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Gabriel C

unread,
Nov 2, 2006, 5:32:48 PM11/2/06
to Mikulas Patocka
Mikulas Patocka wrote:
> Hi
>

Hi

> As my PhD thesis, I am designing and writing a filesystem, and it's now in
> a state that it can be released. You can download it from
> http://artax.karlin.mff.cuni.cz/~mikulas/spadfs/
>

Does not compile for me , using 2.6.18.1 , gcc 4.1.1. Here the error :

/work/crazy/packages/fs/spadfs-0.9.0/super.c: In function 'SPADFS_GET_SB':
/work/crazy/packages/fs/spadfs-0.9.0/super.c:636: error: too few
arguments to function 'get_sb_bdev'
/work/crazy/packages/fs/spadfs-0.9.0/super.c: At top level:
/work/crazy/packages/fs/spadfs-0.9.0/super.c:645: warning:
initialization from incompatible pointer type
/work/crazy/packages/fs/spadfs-0.9.0/super.c:651: warning:
initialization from incompatible pointer type
/work/crazy/packages/fs/spadfs-0.9.0/super.c: In function 'SPADFS_GET_SB':
/work/crazy/packages/fs/spadfs-0.9.0/super.c:637: warning: control
reaches end of non-void function
make[2]: *** [/work/crazy/packages/fs/spadfs-0.9.0/super.o] Error 1
make[1]: *** [_module_/work/crazy/packages/fs/spadfs-0.9.0] Error 2
make[1]: Leaving directory `/usr/src/linux-2.6.18-fw2'
make: *** [spadfs] Error 2

Gabriel

Eric Dumazet

unread,
Nov 2, 2006, 5:53:19 PM11/2/06
to Mikulas Patocka
Mikulas Patocka a écrit :

> Hi
>
> As my PhD thesis, I am designing and writing a filesystem, and it's now
> in a state that it can be released. You can download it from
> http://artax.karlin.mff.cuni.cz/~mikulas/spadfs/
>
> It has some new features, such as keeping inode information directly in
> directory (until you create hardlink) so that ls -la doesn't seek much,
> new method to keep data consistent in case of crashes (instead of
> journaling), free space is organized in lists of free runs and converted
> to bitmap only in case of extreme fragmentation.
>
> It is not very widely tested, so if you want, test it.
>
> I have these questions:
>
> * There is a rw semaphore that is locked for read for nearly all
> operations and locked for write only rarely. However locking for read
> causes cache line pingpong on SMP systems. Do you have an idea how to
> make it better?
>
> It could be improved by making a semaphore for each CPU and locking for
> read only the CPU's semaphore and for write all semaphores. Or is there
> a better method?
>

If you believe you need a semaphore for protecting a mostly read structure,
then RCU is certainly a good candidate. (ie no locked operation at all)

The problem with a per_cpu biglock is that you may consume a lot of RAM for
big NR_CPUS. Count 32 KB per 'biglock' if NR_CPUS=1024

> * This leads to another observation --- on i386 locking a semaphore is 2
> instructions, on x86_64 it is a call to two nested functions. Has it
> some reason or was it just implementator's laziness? Given the fact that
> locked instruction takes 16 ticks on Opteron (and can overlap about 2
> ticks with other instructions), it would make sense to have optimized
> semaphores too.

Hum, please dont use *lazy*, this could make Andi unhappy :)

What are you calling semaphore exactly ?
Did you read Documentation/mutex-design.txt ?

Grzegorz Kulewski

unread,
Nov 2, 2006, 5:54:44 PM11/2/06
to Mikulas Patocka
Hi,

On Thu, 2 Nov 2006, Mikulas Patocka wrote:
> As my PhD thesis, I am designing and writing a filesystem, and it's now in a
> state that it can be released. You can download it from
> http://artax.karlin.mff.cuni.cz/~mikulas/spadfs/

"Disk that can atomically write one sector (512 bytes) so that the sector
contains either old or new content in case of crash."

Well, maybe I am completly wrong but as far as I understand no disk
currently will provide such requirement. Disks can have (after halted
write):
- old data,
- new data,
- nothing (unreadable sector - result of not full write and disk internal
checksum failute for that sector, happens especially often if you have
frequent power outages).

And possibly some broken drives may also return you something that they
think is good data but really is not (shouldn't happen since both disks
and cables should be protected by checksums, but hey... you can never be
absolutely sure especially on very big storages).

So... isn't this making your filesystem a little flawed in design?


Thanks,

Grzegorz Kulewski

Eric Dumazet

unread,
Nov 2, 2006, 6:12:56 PM11/2/06
to Grzegorz Kulewski
Grzegorz Kulewski a écrit :

> Hi,
>
> On Thu, 2 Nov 2006, Mikulas Patocka wrote:
>> As my PhD thesis, I am designing and writing a filesystem, and it's
>> now in a state that it can be released. You can download it from
>> http://artax.karlin.mff.cuni.cz/~mikulas/spadfs/
>
> "Disk that can atomically write one sector (512 bytes) so that the sector
> contains either old or new content in case of crash."
>
> Well, maybe I am completly wrong but as far as I understand no disk
> currently will provide such requirement. Disks can have (after halted
> write):
> - old data,
> - new data,
> - nothing (unreadable sector - result of not full write and disk
> internal checksum failute for that sector, happens especially often if
> you have frequent power outages).
>

I believe some vendors have such devices. Mikulas called them 'disk', but it's
in fact a (disk(s), controler, ram, battery)

Some controlers are even able to write into flash memory the un-written data
when/if the battery/power is about to fail. When power goes up, controler can
finaly do the writes on disks.


> And possibly some broken drives may also return you something that they
> think is good data but really is not (shouldn't happen since both disks
> and cables should be protected by checksums, but hey... you can never be
> absolutely sure especially on very big storages).
>
> So... isn't this making your filesystem a little flawed in design?

Well... even RAM can fail :) In this case isnt linux flawed in design ?

Linus Torvalds

unread,
Nov 2, 2006, 6:16:16 PM11/2/06
to Mikulas Patocka

On Thu, 2 Nov 2006, Mikulas Patocka wrote:
>

> * There is a rw semaphore that is locked for read for nearly all operations
> and locked for write only rarely. However locking for read causes cache line
> pingpong on SMP systems. Do you have an idea how to make it better?

In a filesystem, I doubt you should ever really notice, since it should
hopefully be called mainly for operations that need IO, not for cached
stuff. The one exception tends to be "readdir()", which calls down to the
filesystem even though the directory is cached, because the VFS cannot
parse the directory structure on its own.

So in fact, a regular lock can be _better_ than a rw-lock for precisely
the thing you noticed: the lock itself will still ping-pong, but you may
find that other data structures end up more well-behaved.

That said, if you <i>really</i> want to make reads go fast, and you can
share all the data structures well across CPU's (ie you don't do things
that dirty cachelines), then RCU is likely your best bet. It limits you in
certain ways to exactly what you can do, but it certainly allows unlimited
read-side scalability.

Also, you might look into <linux/seqlock.h> if your data structures are
stable enough to be used in that kind of schenario.

Both seqlocks and RCU essentially requires that you are atomic in the
sequence (ie you can't do IO). They also require that no writers will ever
cause the situation that the data structures can ever cause _problems_ for
an unsynchronized reader - the reader will go ahead as if the writer
didn't exist at all (which is what allows things to be fast).

(Seqlocks could be changed to drop the first requirement, although it
could cause some serious starvation issues, so I'm not sure it's a good
idea. For RCU the atomic nature is pretty much designed-in.)

> It could be improved by making a semaphore for each CPU and locking for read
> only the CPU's semaphore and for write all semaphores. Or is there a better
> method?

No. Don't go down there. It's a total mess, and really not worth it. The
deadlocks, the write starvation, everything. If you can't make things work
with either RCU or seqlocks, just use a regular semaphore.

> * This leads to another observation --- on i386 locking a semaphore is 2
> instructions, on x86_64 it is a call to two nested functions. Has it some
> reason or was it just implementator's laziness? Given the fact that locked
> instruction takes 16 ticks on Opteron (and can overlap about 2 ticks with
> other instructions), it would make sense to have optimized semaphores too.

It's generally not worth worrying about. We started doing spinlocks out of
line, essentially because the instruction overhead was less than the I$
overhead (it also allowed us to do the code sharing with debugging locks
more easily).

You do want to avoid locking overhead (so if you're nesting very deep, you
have problems, but your problems are worse than the few extra function
calls). I doubt you'll hit it in practice - see the initial comment on why
things like the FS really should be totally invisible anyway. Apart from
things like readdir() (and possibly "revalidate()", but in a local
filesystem that shouldn't be an issue), you should worry a lot more about
the IO costs than about the CPU costs.

> * How to implement ordered-data consistency? That would mean that on internal
> sync event, I'd have to flush all pages of a files that were extended. I could
> scan all dirty inodes and find pages to flush --- what kernel function would
> you recommend for doing it? Currently I call only sync_blockdev which doesn't
> touch buffers attached to pages.

You're not going to get data consistency with memory-mapped usage anyway,
without the app helping you with msync() etc.

Linus

Mikulas Patocka

unread,
Nov 2, 2006, 6:19:34 PM11/2/06
to Grzegorz Kulewski
> Hi,
>
> On Thu, 2 Nov 2006, Mikulas Patocka wrote:
>> As my PhD thesis, I am designing and writing a filesystem, and it's now in
>> a state that it can be released. You can download it from
>> http://artax.karlin.mff.cuni.cz/~mikulas/spadfs/
>
> "Disk that can atomically write one sector (512 bytes) so that the sector
> contains either old or new content in case of crash."
>
> Well, maybe I am completly wrong but as far as I understand no disk currently
> will provide such requirement. Disks can have (after halted write):
> - old data,
> - new data,
> - nothing (unreadable sector - result of not full write and disk internal
> checksum failute for that sector, happens especially often if you have
> frequent power outages).
>
> And possibly some broken drives may also return you something that they think
> is good data but really is not (shouldn't happen since both disks and cables
> should be protected by checksums, but hey... you can never be absolutely sure
> especially on very big storages).
>
> So... isn't this making your filesystem a little flawed in design?

There was discussion about it here some times ago, and I think the result
was that the IDE bus is reset prior to capacitors discharge and total loss
of power and disk has enough time to finish a sector --- but if you have
crap power supply (doesn't signal power loss), crap motherboard (doesn't
reset bus) or crap disk (doesn't respond to reset), it can fail.

BTW. reiserfs and xfs depend on this feature too. ext3 is the only one
that doesn't.

Mikulas

Grzegorz Kulewski

unread,
Nov 2, 2006, 6:30:03 PM11/2/06
to Mikulas Patocka

Hmm, maybe. But I think I saw couple of such bad sectors that were only
bad because of power loss in the wild.


> BTW. reiserfs and xfs depend on this feature too. ext3 is the only one that
> doesn't.

Well, at least for XFS everybody tell that it should be used with UPS only
if you really care about your data. I think it has something to do with
heavy in-RAM caching this filesystem does.

Anyway, it looks strange to list something very fragile and potentially
not existing in the requirements... :-)

Could you explain where exactly do you depend on this requirement? And
what could happen if it is not true?


Thanks,

Grzegorz Kulewski


PS. Do you have any benchmarks of your filesystem? Did you do any longer
automated tests to prove it is not going to loose data to easily?

Andi Kleen

unread,
Nov 2, 2006, 6:42:13 PM11/2/06
to Mikulas Patocka
Mikulas Patocka <mik...@artax.karlin.mff.cuni.cz> writes:

> new method to keep data consistent in case of crashes (instead
> of journaling),

What is that method?

> * There is a rw semaphore that is locked for read for nearly all

Depending on the length of the critical section rw locks are often
not faster than non rw locks because the read case has to bounce
around the cache line of the lock anyways and they're actually
a little more expensive.


> * This leads to another observation --- on i386 locking a semaphore is
> 2 instructions, on x86_64 it is a call to two nested functions. Has it

The second call should be a tail call, i.e. just a jump

The first call isn't needed on a non debug kernel, but doing the
two unconditional jumps should be basically free on a modern OOO CPU.

The actual implementation is spinlock based vs atomic based for i386.
This was because at some point nobody could benchmark a difference
between the two and the spinlock based version is much easier
to verify and to understand. If you can demonstrate a difference
that could be reevaluated.

> some reason or was it just implementator's laziness? Given the fact
> that locked instruction takes 16 ticks on Opteron (and can overlap
> about 2 ticks with other instructions), it would make sense to have
> optimized semaphores too.

In the last time we're going more for saved icache and better
use of branch predictors (who are more happy with less branch locations
to cache) I would expect the call/ret to be executed
mostly in parallel with the other code anyways.

-Andi

Mikulas Patocka

unread,
Nov 2, 2006, 8:19:39 PM11/2/06
to Jörn Engel
> On Thu, 2 November 2006 22:52:47 +0100, Mikulas Patocka wrote:
>>
>> new method to keep data consistent in case of crashes (instead of
>> journaling),
>
> Your 32-bit transaction counter will overflow in the real world. It
> will take a setup with millions of transactions per second and even
> then not trigger for a few years, but when it hits your filesystem,
> the administrator of such a beast won't be happy at all. :)
>
> Jörn

If it overflows, it increases crash count instead. So really you have 2^47
transactions or 65536 crashes and 2^31 transactions between each crash.

Mikulas

Mikulas Patocka

unread,
Nov 2, 2006, 8:23:32 PM11/2/06
to Gabriel C

On Thu, 2 Nov 2006, Gabriel C wrote:

> Mikulas Patocka wrote:
>> Hi
>>
>
> Hi
>
>> As my PhD thesis, I am designing and writing a filesystem, and it's now in
>> a state that it can be released. You can download it from
>> http://artax.karlin.mff.cuni.cz/~mikulas/spadfs/
>>
>
> Does not compile for me , using 2.6.18.1 , gcc 4.1.1. Here the error :
>
> /work/crazy/packages/fs/spadfs-0.9.0/super.c: In function 'SPADFS_GET_SB':
> /work/crazy/packages/fs/spadfs-0.9.0/super.c:636: error: too few
> arguments to function 'get_sb_bdev'
> /work/crazy/packages/fs/spadfs-0.9.0/super.c: At top level:
> /work/crazy/packages/fs/spadfs-0.9.0/super.c:645: warning:
> initialization from incompatible pointer type
> /work/crazy/packages/fs/spadfs-0.9.0/super.c:651: warning:
> initialization from incompatible pointer type
> /work/crazy/packages/fs/spadfs-0.9.0/super.c: In function 'SPADFS_GET_SB':
> /work/crazy/packages/fs/spadfs-0.9.0/super.c:637: warning: control
> reaches end of non-void function
> make[2]: *** [/work/crazy/packages/fs/spadfs-0.9.0/super.o] Error 1
> make[1]: *** [_module_/work/crazy/packages/fs/spadfs-0.9.0] Error 2
> make[1]: Leaving directory `/usr/src/linux-2.6.18-fw2'
> make: *** [spadfs] Error 2

Hmm, I see, they changed some stuff ... and in 2.6.19 too. I made a new
version that compiles with 2.6.18 and 2.6.19rc4, so try it.

BTW. I've found a weird code in 2.6.19rc4 in vfs_getattr:
generic_fillattr(inode, stat); (ends with stat->blksize = (1 <<
inode->i_blkbits);)
and then
if (!stat->blksize) {...

Someone made this bug when changing it.

Mikulas

Mikulas Patocka

unread,
Nov 2, 2006, 8:28:37 PM11/2/06
to Eric Dumazet
>> I have these questions:
>>
>> * There is a rw semaphore that is locked for read for nearly all operations
>> and locked for write only rarely. However locking for read causes cache
>> line pingpong on SMP systems. Do you have an idea how to make it better?
>>
>> It could be improved by making a semaphore for each CPU and locking for
>> read only the CPU's semaphore and for write all semaphores. Or is there a
>> better method?
>>
>
> If you believe you need a semaphore for protecting a mostly read structure,
> then RCU is certainly a good candidate. (ie no locked operation at all)

RCU is for non-blocking operation only.

Maybe it could be use for maintaining an unlocked variable that prevents
readers from entering a critical section --- but waiting in a write for a
quiscent state would hurt.

> The problem with a per_cpu biglock is that you may consume a lot of RAM for
> big NR_CPUS. Count 32 KB per 'biglock' if NR_CPUS=1024
>
>> * This leads to another observation --- on i386 locking a semaphore is 2
>> instructions, on x86_64 it is a call to two nested functions. Has it some
>> reason or was it just implementator's laziness? Given the fact that locked
>> instruction takes 16 ticks on Opteron (and can overlap about 2 ticks with
>> other instructions), it would make sense to have optimized semaphores too.
>
> Hum, please dont use *lazy*, this could make Andi unhappy :)
>
> What are you calling semaphore exactly ?
> Did you read Documentation/mutex-design.txt ?

I see --- I could replace that semaphore with mutex. But rw-semaphores
can't be replaces with them.

Mikulas

Mikulas Patocka

unread,
Nov 2, 2006, 8:34:55 PM11/2/06
to Grzegorz Kulewski
>> There was discussion about it here some times ago, and I think the result
>> was that the IDE bus is reset prior to capacitors discharge and total loss
>> of power and disk has enough time to finish a sector --- but if you have
>> crap power supply (doesn't signal power loss), crap motherboard (doesn't
>> reset bus) or crap disk (doesn't respond to reset), it can fail.
>
> Hmm, maybe. But I think I saw couple of such bad sectors that were only bad
> because of power loss in the wild.
>
>
>> BTW. reiserfs and xfs depend on this feature too. ext3 is the only one that
>> doesn't.
>
> Well, at least for XFS everybody tell that it should be used with UPS only if
> you really care about your data. I think it has something to do with heavy
> in-RAM caching this filesystem does.

System is allowed to cache anything unless sync/fsync is called. Someone
told that XFS has some bugs that if crashed incorrectly, it can lose
already synced data ... don't know. Plus it has that infamous feature (not
a bug) that it commits size-increase but not data and you see zero-filed
files.

> Anyway, it looks strange to list something very fragile and potentially not
> existing in the requirements... :-)

Better to list it than quitly depend on it like ext2/fat/reiser/xfs/
(maybe jfs?) do.

> Could you explain where exactly do you depend on this requirement? And what
> could happen if it is not true?

If you write a file in a directory and the sector is unwritable upon write
& crash, you lose those few files near it. Just the similar way you would
lose 4 files in inode table on ext2 in this case.

> Thanks,
>
> Grzegorz Kulewski
>
>
> PS. Do you have any benchmarks of your filesystem? Did you do any longer
> automated tests to prove it is not going to loose data to easily?

I have, I may find them and post them. (but the university wants me to
post them to some conference, so I should keep them secret :-/)

Mikulas

Andrew Morton

unread,
Nov 2, 2006, 8:42:19 PM11/2/06
to Mikulas Patocka
On Fri, 3 Nov 2006 02:22:57 +0100 (CET)
Mikulas Patocka <mik...@artax.karlin.mff.cuni.cz> wrote:

> BTW. I've found a weird code in 2.6.19rc4 in vfs_getattr:
> generic_fillattr(inode, stat); (ends with stat->blksize = (1 <<
> inode->i_blkbits);)
> and then
> if (!stat->blksize) {...

Good point. I queued a patch to kill it.

> Someone made this bug when changing it.

Well, not really. I very much doubt if we ever had any inodes with a zero
in ->i_blksize. I suspect that code (which has been like that since at
least 2.6.12) just never did anything.

From: Andrew Morton <ak...@osdl.org>

As Mikulas points out, (1 << anything) won't be evaluating to zero. This code
is long-dead.

Cc: Mikulas Patocka <mik...@artax.karlin.mff.cuni.cz>
Signed-off-by: Andrew Morton <ak...@osdl.org>
---

fs/stat.c | 7 -------
1 files changed, 7 deletions(-)

diff -puN fs/stat.c~vfs_getattr-remove-dead-code fs/stat.c
--- a/fs/stat.c~vfs_getattr-remove-dead-code
+++ a/fs/stat.c
@@ -51,13 +51,6 @@ int vfs_getattr(struct vfsmount *mnt, st
return inode->i_op->getattr(mnt, dentry, stat);

generic_fillattr(inode, stat);
- if (!stat->blksize) {
- struct super_block *s = inode->i_sb;
- unsigned blocks;
- blocks = (stat->size+s->s_blocksize-1) >> s->s_blocksize_bits;
- stat->blocks = (s->s_blocksize / 512) * blocks;
- stat->blksize = s->s_blocksize;
- }
return 0;
}

_

Andrew Morton

unread,
Nov 2, 2006, 8:43:49 PM11/2/06
to Mikulas Patocka
On Fri, 3 Nov 2006 02:28:17 +0100 (CET)
Mikulas Patocka <mik...@artax.karlin.mff.cuni.cz> wrote:

> RCU is for non-blocking operation only.

We have SRCU now, which allegedly permits sleeping inside the critical section.

Mikulas Patocka

unread,
Nov 2, 2006, 8:45:43 PM11/2/06
to Andi Kleen
> Mikulas Patocka <mik...@artax.karlin.mff.cuni.cz> writes:
>
>> new method to keep data consistent in case of crashes (instead
>> of journaling),
>
> What is that method?

Some tricks to avoid journal --- see
http://artax.karlin.mff.cuni.cz/~mikulas/spadfs/download/INTERNALS

--- unlike journaling it survives only 65536 crashes :)

>> * There is a rw semaphore that is locked for read for nearly all
>
> Depending on the length of the critical section rw locks are often
> not faster than non rw locks because the read case has to bounce
> around the cache line of the lock anyways and they're actually
> a little more expensive.

This critical section is long --- i.e. any reads/writes to disk. Making it
simple semaphore would effectively serialize all operations.

>> * This leads to another observation --- on i386 locking a semaphore is
>> 2 instructions, on x86_64 it is a call to two nested functions. Has it
>
> The second call should be a tail call, i.e. just a jump

It is down_write -> (tailcall) down_write_nested -> (normal call)
spin_lock_irq and spin_unlock_irq.

> The first call isn't needed on a non debug kernel, but doing the
> two unconditional jumps should be basically free on a modern OOO CPU.

But it kills one cacheline.

> The actual implementation is spinlock based vs atomic based for i386.
> This was because at some point nobody could benchmark a difference
> between the two and the spinlock based version is much easier
> to verify and to understand. If you can demonstrate a difference
> that could be reevaluated.

Maybe one day I'll try it.

>> some reason or was it just implementator's laziness? Given the fact
>> that locked instruction takes 16 ticks on Opteron (and can overlap
>> about 2 ticks with other instructions), it would make sense to have
>> optimized semaphores too.
>
> In the last time we're going more for saved icache and better
> use of branch predictors (who are more happy with less branch locations
> to cache) I would expect the call/ret to be executed
> mostly in parallel with the other code anyways.

I see, but pushf, cli and popf in that spinlock hurt too (especially on
Intel, it has them completely microcoded --- pushf/popf pair is 100
ticks on Intel P4E and 12 ticks on Opteron).

Mikulas

Gabriel C

unread,
Nov 2, 2006, 9:09:45 PM11/2/06
to Mikulas Patocka

This error looks fixed, now I have a new one here :)

cc -D__NOT_FROM_SPAD -D__NOT_FROM_SPAD_TREE -Wall
-fdollars-in-identifiers -O2 -fomit-frame-pointer -c -o MKSPADFS.o -x c
MKSPADFS.C
MKSPADFS.C:146: error: expected declaration specifiers or '...' before
'_llseek'
MKSPADFS.C:146: error: expected declaration specifiers or '...' before 'fd'
MKSPADFS.C:146: error: expected declaration specifiers or '...' before 'hi'
MKSPADFS.C:146: error: expected declaration specifiers or '...' before 'lo'
MKSPADFS.C:146: error: expected declaration specifiers or '...' before 'res'
MKSPADFS.C:146: error: expected declaration specifiers or '...' before 'wh'
MKSPADFS.C:146: warning: type defaults to 'int' in declaration of
'_syscall5'
In file included from MKSPADFS.C:153:
GETHSIZE.I: In function 'test_access':
GETHSIZE.I:13: warning: implicit declaration of function '_llseek'
make: *** [MKSPADFS.o] Error 1


> BTW. I've found a weird code in 2.6.19rc4 in vfs_getattr:
> generic_fillattr(inode, stat); (ends with stat->blksize = (1 <<
> inode->i_blkbits);)
> and then
> if (!stat->blksize) {...
>
> Someone made this bug when changing it.
>
> Mikulas
>
>

Gabriel

Jan Engelhardt

unread,
Nov 3, 2006, 3:27:44 AM11/3/06
to Gabriel C
>>>> As my PhD thesis, I am designing and writing a filesystem, and it's now in
>>>> a state that it can be released. You can download it from
>>>> http://artax.karlin.mff.cuni.cz/~mikulas/spadfs/
>>>>
>> Hmm, I see, they changed some stuff ... and in 2.6.19 too. I made a new
>> version that compiles with 2.6.18 and 2.6.19rc4, so try it.
>
>This error looks fixed, now I have a new one here :)
>
>cc -D__NOT_FROM_SPAD -D__NOT_FROM_SPAD_TREE -Wall
>-fdollars-in-identifiers -O2 -fomit-frame-pointer -c -o MKSPADFS.o -x c
>MKSPADFS.C
>MKSPADFS.C:146: error: expected declaration specifiers or '...' before
>'_llseek'
>MKSPADFS.C:146: error: expected declaration specifiers or '...' before 'fd'
>MKSPADFS.C:146: error: expected declaration specifiers or '...' before 'hi'
>MKSPADFS.C:146: error: expected declaration specifiers or '...' before 'lo'
>MKSPADFS.C:146: error: expected declaration specifiers or '...' before 'res'
>MKSPADFS.C:146: error: expected declaration specifiers or '...' before 'wh'
>MKSPADFS.C:146: warning: type defaults to 'int' in declaration of
>'_syscall5'

Ugh this syscall 'crap' is butt-ugly.

So anyway, why do you need _llseek? Can't you just use lseek() like
everyone else?

>In file included from MKSPADFS.C:153:
>GETHSIZE.I: In function 'test_access':
>GETHSIZE.I:13: warning: implicit declaration of function '_llseek'
>make: *** [MKSPADFS.o] Error 1

And why are the filenames all uppercase? I think I left the DOS times
some years ago. Your makefile also has some pits. Look at unionfs how to
organize kernel and userspace files in a better way in one tree.


spadfs-01.diff
Index: spadfs-0.9.1/Kbuild
===================================================================
--- /dev/null
+++ spadfs-0.9.1/Kbuild
@@ -0,0 +1,2 @@
+obj-m := spadfs.o
+spadfs-y := alloc.o buffer.o dir.o file.o inode.o link.o name.o namei.o super.o
Index: spadfs-0.9.1/Makefile
===================================================================
--- spadfs-0.9.1.orig/Makefile
+++ spadfs-0.9.1/Makefile
@@ -1,6 +1,4 @@
ifneq ($(KERNELRELEASE),)
-obj-m := spadfs.o
-spadfs-y := alloc.o buffer.o dir.o file.o inode.o link.o name.o namei.o super.o
else
KERNELDIR := /usr/src/linux-`uname -r`/
all : spadfs mkspadfs spadfsck
#<EOF>


spadfs-02.diff
Index: spadfs-0.9.1/Kbuild
===================================================================
--- spadfs-0.9.1.orig/Kbuild
+++ spadfs-0.9.1/Kbuild
@@ -1,2 +1,2 @@
-obj-m := spadfs.o
+obj-m += spadfs.o
spadfs-y := alloc.o buffer.o dir.o file.o inode.o link.o name.o namei.o super.o
Index: spadfs-0.9.1/Makefile
===================================================================
--- spadfs-0.9.1.orig/Makefile
+++ spadfs-0.9.1/Makefile
@@ -1,19 +1,24 @@
-ifneq ($(KERNELRELEASE),)
-else
-KERNELDIR := /usr/src/linux-`uname -r`/
+
+KERNELDIR := /lib/modules/$(shell uname -r)/build
+
all : spadfs mkspadfs spadfsck
+
spadfs :
$(MAKE) -C $(KERNELDIR) M=`pwd`
-CFLAGS=-D__NOT_FROM_SPAD -D__NOT_FROM_SPAD_TREE -Wall -fdollars-in-identifiers -O2 -fomit-frame-pointer
+
+CFLAGS := -D__NOT_FROM_SPAD -D__NOT_FROM_SPAD_TREE -Wall -fdollars-in-identifiers -O2 -fomit-frame-pointer
#CC=icc
#CFLAGS=-D__NOT_FROM_SPAD -D__NOT_FROM_SPAD_TREE
+
%.o : %.C
$(CC) $(CFLAGS) -c -o $@ -x c $<
+
mkspadfs : MKSPADFS.o SFSAPAGE.o
$(CC) $(CFLAGS) -o $@ $^ && strip $@
+
spadfsck : SPAD-API.o FSCK/SCK.o FSCK/SCKALLOC.o FSCK/SCKAPAGE.o FSCK/SCKBUF.o FSCK/SCKCCT.o FSCK/SCKCRT.o FSCK/SCKDIR.o FSCK/SCKEA.o FSCK/SCKFBLK.o FSCK/SCKFILE.o FSCK/SCKFN.o FSCK/SCKFXFN.o FSCK/SCKHDLNK.o FSCK/SCKLOG.o FSCK/SCKRCV.o FSCK/SCKSUPER.o FSCK/SCKXL.o
$(CC) $(CFLAGS) -o $@ $^ && strip $@
+
clean :
- rm -f *.o *.ko mkspadfs FSCK/*.o spadfsck .*.cmd spadfs.mod.c Modules.symvers
- rm -rf .tmp_versions
-endif
+ ${MAKE} -C ${KERNELDIR} M=$$PWD clean
+ rm -f *.o mkspadfs FSCK/*.o spadfsck Modules.symvers
#<EOF>


-`J'
--

Jörn Engel

unread,
Nov 3, 2006, 5:19:27 AM11/3/06
to Mikulas Patocka
On Fri, 3 November 2006 02:19:05 +0100, Mikulas Patocka wrote:
> >On Thu, 2 November 2006 22:52:47 +0100, Mikulas Patocka wrote:
> >>
> >>new method to keep data consistent in case of crashes (instead of
> >>journaling),
> >
> >Your 32-bit transaction counter will overflow in the real world. It
> >will take a setup with millions of transactions per second and even
> >then not trigger for a few years, but when it hits your filesystem,
> >the administrator of such a beast won't be happy at all. :)
>
> If it overflows, it increases crash count instead. So really you have 2^47
> transactions or 65536 crashes and 2^31 transactions between each crash.

I am fully aware the counters are effectively 48-bit. If they were
just 32-bit, you would likely have hit the problem yourself already.

Jörn

--
Data expands to fill the space available for storage.
-- Parkinson's Law

Mikulas Patocka

unread,
Nov 3, 2006, 6:48:47 AM11/3/06
to Gabriel C
> This error looks fixed, now I have a new one here :)
>
> cc -D__NOT_FROM_SPAD -D__NOT_FROM_SPAD_TREE -Wall
> -fdollars-in-identifiers -O2 -fomit-frame-pointer -c -o MKSPADFS.o -x c
> MKSPADFS.C
> MKSPADFS.C:146: error: expected declaration specifiers or '...' before
> '_llseek'
> MKSPADFS.C:146: error: expected declaration specifiers or '...' before 'fd'
> MKSPADFS.C:146: error: expected declaration specifiers or '...' before 'hi'
> MKSPADFS.C:146: error: expected declaration specifiers or '...' before 'lo'
> MKSPADFS.C:146: error: expected declaration specifiers or '...' before 'res'
> MKSPADFS.C:146: error: expected declaration specifiers or '...' before 'wh'
> MKSPADFS.C:146: warning: type defaults to 'int' in declaration of
> '_syscall5'
> In file included from MKSPADFS.C:153:
> GETHSIZE.I: In function 'test_access':
> GETHSIZE.I:13: warning: implicit declaration of function '_llseek'
> make: *** [MKSPADFS.o] Error 1

Pls send me output of C preprocessor. (i.e. gcc -E)

Mikulas

Mikulas Patocka

unread,
Nov 3, 2006, 6:52:52 AM11/3/06
to Jan Engelhardt
>> This error looks fixed, now I have a new one here :)
>>
>> cc -D__NOT_FROM_SPAD -D__NOT_FROM_SPAD_TREE -Wall
>> -fdollars-in-identifiers -O2 -fomit-frame-pointer -c -o MKSPADFS.o -x c
>> MKSPADFS.C
>> MKSPADFS.C:146: error: expected declaration specifiers or '...' before
>> '_llseek'
>> MKSPADFS.C:146: error: expected declaration specifiers or '...' before 'fd'
>> MKSPADFS.C:146: error: expected declaration specifiers or '...' before 'hi'
>> MKSPADFS.C:146: error: expected declaration specifiers or '...' before 'lo'
>> MKSPADFS.C:146: error: expected declaration specifiers or '...' before 'res'
>> MKSPADFS.C:146: error: expected declaration specifiers or '...' before 'wh'
>> MKSPADFS.C:146: warning: type defaults to 'int' in declaration of
>> '_syscall5'
>
> Ugh this syscall 'crap' is butt-ugly.

Yes, it is.

> So anyway, why do you need _llseek? Can't you just use lseek() like
> everyone else?

Because I want it to work with glibc 2.0 that I still use on one machine.

>> In file included from MKSPADFS.C:153:
>> GETHSIZE.I: In function 'test_access':
>> GETHSIZE.I:13: warning: implicit declaration of function '_llseek'
>> make: *** [MKSPADFS.o] Error 1
>
> And why are the filenames all uppercase? I think I left the DOS times
> some years ago. Your makefile also has some pits. Look at unionfs how to
> organize kernel and userspace files in a better way in one tree.

Because they are shared with another project and it is just practical to
copy them without renaming. Yes --- it looks inconsistent on Linux side.

Mikulas

Mikulas Patocka

unread,
Nov 3, 2006, 6:56:59 AM11/3/06
to Jörn Engel
>>>> new method to keep data consistent in case of crashes (instead of
>>>> journaling),
>>>
>>> Your 32-bit transaction counter will overflow in the real world. It
>>> will take a setup with millions of transactions per second and even
>>> then not trigger for a few years, but when it hits your filesystem,
>>> the administrator of such a beast won't be happy at all. :)
>>
>> If it overflows, it increases crash count instead. So really you have 2^47
>> transactions or 65536 crashes and 2^31 transactions between each crash.
>
> I am fully aware the counters are effectively 48-bit. If they were
> just 32-bit, you would likely have hit the problem yourself already.
>
> Jörn

Given the seek time 0.01s, 31-bit value would last for minimum time of 248
days when doing only syncs and nothing else. 47-bit value will last for
reasonably long.

Mikulas

Mikulas Patocka

unread,
Nov 3, 2006, 7:00:03 AM11/3/06
to Jan Engelhardt

On Fri, 3 Nov 2006, Mikulas Patocka wrote:

>>> This error looks fixed, now I have a new one here :)
>>>
>>> cc -D__NOT_FROM_SPAD -D__NOT_FROM_SPAD_TREE -Wall
>>> -fdollars-in-identifiers -O2 -fomit-frame-pointer -c -o MKSPADFS.o -x c
>>> MKSPADFS.C
>>> MKSPADFS.C:146: error: expected declaration specifiers or '...' before
>>> '_llseek'
>>> MKSPADFS.C:146: error: expected declaration specifiers or '...' before
>>> 'fd'
>>> MKSPADFS.C:146: error: expected declaration specifiers or '...' before
>>> 'hi'
>>> MKSPADFS.C:146: error: expected declaration specifiers or '...' before
>>> 'lo'
>>> MKSPADFS.C:146: error: expected declaration specifiers or '...' before
>>> 'res'
>>> MKSPADFS.C:146: error: expected declaration specifiers or '...' before
>>> 'wh'
>>> MKSPADFS.C:146: warning: type defaults to 'int' in declaration of
>>> '_syscall5'
>>
>> Ugh this syscall 'crap' is butt-ugly.
>
> Yes, it is.
>
>> So anyway, why do you need _llseek? Can't you just use lseek() like
>> everyone else?
>
> Because I want it to work with glibc 2.0 that I still use on one machine.

BTW. is it some interaction with symbols defined elsewhere or were
_syscall macros dropped altogether? Which glibc symbol should I use in
#ifdef to tell if glibc has 64-bit support?

Mikulas

Ric Wheeler

unread,
Nov 3, 2006, 8:07:01 AM11/3/06
to Mikulas Patocka

Mikulas Patocka wrote:

These are two examples of very different classes of storage devices - if
you use a high end array (like EMC Clariion/Symm, IBM Shark, Hitachi,
NetApp Block, etc) once the target device acknowledges the write
transaction, you have a hard promise that the data is going to persist
after a power outage, etc.

If you are using a a commodity disk, then you really have to worry about
how the drive's write cache will handle your IO. These disks will ack
the write once they have stored the write request in their volatile
memory which can be lost on power outages.

That is a reasonable setting for most end users (high performance, few
power outages and some risk of data loss), but when data integrity is a
hard requirement, people typically run with the write cache disabled.

The "write barrier" support that is in reiserfs, ext3 and xfs all
provide something that is somewhere in the middle - good performance and
cache flushes injected on transaction commits or application level
fsync() commands.

I would not depend on the IDE bus reset or draining capacitors to safely
destage data - in fact, I know that it will routinely fail when we test
the write barrier on/off over power outages.

Modern S-ATA/ATA drives have 16MB or more of data in write cache and
there is a lot of data to destage in those last few ms ;-)

Mikulas Patocka

unread,
Nov 3, 2006, 8:32:25 AM11/3/06
to Jörn Engel
>>> I am fully aware the counters are effectively 48-bit. If they were
>>> just 32-bit, you would likely have hit the problem yourself already.
>>
>> Given the seek time 0.01s, 31-bit value would last for minimum time of 248
>> days when doing only syncs and nothing else. 47-bit value will last for
>> reasonably long.
>
> So you can at most do one transaction per drive seek? That would
> definitely solve the overflow case, but hardly sounds like a
> high-performance filesystem. :)

Really it can batch any number of modifications into one transaction
(unless fsync or sync is called). Transaction is closed only on
fsync/sync, if 2 minutes pass (can be adjusted) or when the disk runs out
of space.

Mikulas

Mikulas

Jörn Engel

unread,
Nov 3, 2006, 8:48:36 AM11/3/06
to Mikulas Patocka
On Fri, 3 November 2006 14:31:46 +0100, Mikulas Patocka wrote:
>
> Really it can batch any number of modifications into one transaction
> (unless fsync or sync is called). Transaction is closed only on
> fsync/sync, if 2 minutes pass (can be adjusted) or when the disk runs out
> of space.

Interesting. Let me pick an example and see where we're going from
there. You have four directories, A, B, C and D, none of which is the
parent of another. Two cross-directory renames happen:
$ mv A/foo B/
$ mv C/bar D/

This will cause four modifications, one to each of the directories. I
would have assumed that the modifications to A and B receive one
transaction number n, C and D get a different one, n+1 if nothing else
is going on in between.

To commit the first rename, n is written into cct[entry->cc]. To
commit both, n+1 is written instead. Committing the second
transaction without committing the first is not possible.

Now clearly we are disagreeing, so I must have misunderstood your
design somehow. Can you see how?

Jörn

--
The story so far:
In the beginning the Universe was created. This has made a lot
of people very angry and been widely regarded as a bad move.
-- Douglas Adams

Mikulas Patocka

unread,
Nov 3, 2006, 9:19:57 AM11/3/06
to Jörn Engel
>> Really it can batch any number of modifications into one transaction
>> (unless fsync or sync is called). Transaction is closed only on
>> fsync/sync, if 2 minutes pass (can be adjusted) or when the disk runs out
>> of space.
>
> Interesting. Let me pick an example and see where we're going from
> there. You have four directories, A, B, C and D, none of which is the
> parent of another. Two cross-directory renames happen:
> $ mv A/foo B/
> $ mv C/bar D/
>
> This will cause four modifications, one to each of the directories. I
> would have assumed that the modifications to A and B receive one
> transaction number n, C and D get a different one, n+1 if nothing else
> is going on in between.

They most likely receive the same transaction (this is not like journaling
transaction --- new transactions are issued only on conditions above).

A/foo entry is set with txc=memory_cct[memory_cc],cc=memory_cc
B/foo entry is set with txc=memory_cct[memoty_cc]|0x80000000,cc=memory_cc
C/foo entry is set with txc=memory_cct[memory_cc],cc=memory_cc
D/foo entry is set with txc=memory_cct[memoty_cc]|0x80000000,cc=memory_cc

They may be written in any order (that's some improvement over
journaling) by buffer thread.

And when you sync, with one write of memory_cct to disk, you make old
entries permanently invalid and new entries permanently valid.

If the machine crashes before sync (and some of directory sectors were
written and some not), new entries will always be considered invalid, and
old entries always valid, because new crash count will be used and crash
count table at old crash count index will never be modified.

> To commit the first rename, n is written into cct[entry->cc]. To
> commit both, n+1 is written instead. Committing the second
> transaction without committing the first is not possible.
>
> Now clearly we are disagreeing, so I must have misunderstood your
> design somehow. Can you see how?
>
> Jörn

Mikulas

Mikulas Patocka

unread,
Nov 3, 2006, 9:40:02 AM11/3/06
to Nikita Danilov

On Fri, 3 Nov 2006, Nikita Danilov wrote:

> Mikulas Patocka writes:
> > > Mikulas Patocka <mik...@artax.karlin.mff.cuni.cz> writes:
> > >
> > >> new method to keep data consistent in case of crashes (instead
> > >> of journaling),
> > >
> > > What is that method?
> >
> > Some tricks to avoid journal --- see
> > http://artax.karlin.mff.cuni.cz/~mikulas/spadfs/download/INTERNALS
> >
> > --- unlike journaling it survives only 65536 crashes :)
>

> What happens when hard-linked file is accessed, and it is found that
> last fnode (one in fixed_fnode_block), has wrong "crash count"?
>
> Nikita.

Fixed fnode block contains (txc,cc) pair describing which fnode and nlink
count is valid. --- currently two fnodes are superflous (there could be
just one), they are reserved for the possibility to atomically modify
extended attributes --- but there is no code currently that does it.

The fnodes live on their own with their own (txc,cc) pair --- it is a bit
confusing to have pair on both fixed_fnode_block and fnode --- the reason
is that the code for handling fnodes in directories can be reused to
handle fnodes in fixed_fnode_blocks and I can avoid many
if (is_fnode_fixed()) branches.

If the fnode in fixed_fnode_block has invalid crash count,
fixed_fnode_block's (cc,txc) pair should never point to it. Or did it
happen to you?

Mikulas

Jörn Engel

unread,
Nov 3, 2006, 9:53:50 AM11/3/06
to Mikulas Patocka
On Fri, 3 November 2006 15:19:09 +0100, Mikulas Patocka wrote:
>
> >>Really it can batch any number of modifications into one transaction
> >>(unless fsync or sync is called). Transaction is closed only on
> >>fsync/sync, if 2 minutes pass (can be adjusted) or when the disk runs out
> >>of space.
> >
> >Interesting. Let me pick an example and see where we're going from
> >there. You have four directories, A, B, C and D, none of which is the
> >parent of another. Two cross-directory renames happen:
> >$ mv A/foo B/
> >$ mv C/bar D/
> >
> >This will cause four modifications, one to each of the directories. I
> >would have assumed that the modifications to A and B receive one
> >transaction number n, C and D get a different one, n+1 if nothing else
> >is going on in between.
>
> They most likely receive the same transaction (this is not like journaling
> transaction --- new transactions are issued only on conditions above).

That means one transaction in your terminology contains many
transactions in my terminology, which explains a bit of confusion.

> They may be written in any order (that's some improvement over
> journaling) by buffer thread.
>
> And when you sync, with one write of memory_cct to disk, you make old
> entries permanently invalid and new entries permanently valid.

Ok, I seem to understand it now. Quite interesting.

> If the machine crashes before sync (and some of directory sectors were
> written and some not), new entries will always be considered invalid, and
> old entries always valid, because new crash count will be used and crash
> count table at old crash count index will never be modified.

So the only overflow you have to fear is the 16-bit cc overflow. Once
that hits your filesystem is end-of-life and cannot be written to
anymore.

Has it ever occurred to you how similar your approach is to the
venerable sprite lfs? Lfs syncs by writing a "checkpoint", which
contains quite a bit of information. You sync by just writing a
single number. But in the end, both designs have a lot of
non-committed data already written to disk which only becomes valid
once the final (checkpoint|transaction count) is written.

And considering that writing several kB of contiguous data to disk is
nearly free, compared to the initial seek, both commit operations
should take about the same time.

So which, if I may ask, are the advantages of your design over sprite
lfs?

http://citeseer.ist.psu.edu/rosenblum91design.html

Jörn

--
ticks = jiffies;
while (ticks == jiffies);
ticks = jiffies;
-- /usr/src/linux/init/main.c

Oleg Verych

unread,
Nov 3, 2006, 12:08:57 PM11/3/06
to Andrew Morton
In gmane.linux.kernel, you wrote:
[]

> From: Andrew Morton <ak...@osdl.org>
>
> As Mikulas points out, (1 << anything) won't be evaluating to zero.

How about integer overflow ?
____

Oleg Verych

unread,
Nov 3, 2006, 12:31:22 PM11/3/06
to Mikulas Patocka
On Fri, Nov 03, 2006 at 06:09:39PM +0100, Mikulas Patocka wrote:
> >In gmane.linux.kernel, you wrote:
> >[]
> >>From: Andrew Morton <ak...@osdl.org>
> >>
> >>As Mikulas points out, (1 << anything) won't be evaluating to zero.
> >
> >How about integer overflow ?
>
> C standard defines that shifts by more bits than size of a type are
> undefined (in fact 1<<32 produces 1 on i386, because processor uses only 5
> bits of a count).
,--
|#include <stdio.h>
|int main(void) {
| unsigned int b = 1;
|
| printf("%u\n", (1 << 33));
| printf("%u\n", (b << 33));
| return 0;
|}
|$ gcc bit.c && ./a.out
`--

There *is* difference, isn't it?

> Mikulas

Mikulas Patocka

unread,
Nov 3, 2006, 1:15:49 PM11/3/06
to Oleg Verych

On Fri, 3 Nov 2006, Oleg Verych wrote:

> On Fri, Nov 03, 2006 at 06:09:39PM +0100, Mikulas Patocka wrote:
>>> In gmane.linux.kernel, you wrote:
>>> []
>>>> From: Andrew Morton <ak...@osdl.org>
>>>>
>>>> As Mikulas points out, (1 << anything) won't be evaluating to zero.
>>>
>>> How about integer overflow ?
>>
>> C standard defines that shifts by more bits than size of a type are
>> undefined (in fact 1<<32 produces 1 on i386, because processor uses only 5
>> bits of a count).
> ,--
> |#include <stdio.h>
> |int main(void) {
> | unsigned int b = 1;
> |
> | printf("%u\n", (1 << 33));
> | printf("%u\n", (b << 33));
> | return 0;
> |}
> |$ gcc bit.c && ./a.out
> `--
>
> There *is* difference, isn't it?

The standard says that the result is undefined, so the compiler is
standard-compliant. It could have returned any numbers and still be
correct.

Mikulas

Mikulas Patocka

unread,
Nov 3, 2006, 1:48:46 PM11/3/06
to Jan Engelhardt
>>>> So anyway, why do you need _llseek? Can't you just use lseek() like
>>>> everyone else?
>>>
>>> Because I want it to work with glibc 2.0 that I still use on one machine.
>>
>> BTW. is it some interaction with symbols defined elsewhere or were _syscall
>> macros dropped altogether? Which glibc symbol should I use in #ifdef to tell if
>> glibc has 64-bit support?
>
> -D_LARGEFILE_SOURCE=1 -D_LARGE_FILES -D_FILE_OFFSET_BITS=64
>
> I think the second is not needed.

I see, but the question is how do I test in C preprocessor that glibc is
sufficiently new to react on them?

Now I changed it to:
#ifdef __linux__
#if !defined(__GLIBC__) || __GLIBC__ < 2 || (__GLIBC__ == 2 && __GLIBC_MINOR__ < 1)
#include <asm/unistd.h>
#ifdef __NR__llseek
#define use_llseek
static _syscall5(int, _llseek, uint, fd, ulong, hi, ulong, lo, loff_t *,
res, uint, wh);
#endif
#endif

So we see if someone else runs into problem.

dean gaudet

unread,
Nov 3, 2006, 2:01:39 PM11/3/06
to Mikulas Patocka

it seems to me that you only need to be able to represent a range of the
most recent 65536 crashes... and could have an online process which goes
about "refreshing" old objects to move them forward to the most recent
crash state. as long as you know the minimm on-disk crash count you can
use it as an offset.

-dean

p.s. i could see a network device with spotty connectivity causing a large
bump in crash count in a short period of time.

Mikulas Patocka

unread,
Nov 3, 2006, 2:02:27 PM11/3/06
to Jörn Engel

It is very different from LFS. LFS is log-filesystem, i.e. journal spans
the whole device. The problem with this design is that it's fast for write
(cool benchmark numbers) and slow in real-world workloads.

LFS places files according to time they were created, not according to
their directory.

If you have directory with some project where you have files that you
edited today, day ago, week ago, month ago etc., then any current
filesystem (even ext2) will try to place files near each other --- while
LFS will scatter the files over the whole partition according to time they
were written. --- I believe that this is the reason why log-structured
filesystems are not in wild use --- this is a case where optimizing for
benchmark kills real-world performance.

Mikulas

Adrian Bunk

unread,
Nov 3, 2006, 2:08:44 PM11/3/06
to Oleg Verych
On Fri, Nov 03, 2006 at 06:36:09PM +0100, Oleg Verych wrote:
> On Fri, Nov 03, 2006 at 06:09:39PM +0100, Mikulas Patocka wrote:
> > >In gmane.linux.kernel, you wrote:
> > >[]
> > >>From: Andrew Morton <ak...@osdl.org>
> > >>
> > >>As Mikulas points out, (1 << anything) won't be evaluating to zero.
> > >
> > >How about integer overflow ?
> >
> > C standard defines that shifts by more bits than size of a type are
> > undefined (in fact 1<<32 produces 1 on i386, because processor uses only 5
> > bits of a count).
> ,--
> |#include <stdio.h>
> |int main(void) {
> | unsigned int b = 1;
> |
> | printf("%u\n", (1 << 33));
> | printf("%u\n", (b << 33));
> | return 0;
> |}
> |$ gcc bit.c && ./a.out
> `--
>
> There *is* difference, isn't it?

It's undefined, and the results with your example depend on the gcc
version and optimization level.

E.g. with gcc 4.1, there is *no* difference any more if you turn on
optimization.

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

Oleg Verych

unread,
Nov 3, 2006, 2:28:03 PM11/3/06
to Adrian Bunk
On Fri, Nov 03, 2006 at 08:08:24PM +0100, Adrian Bunk wrote:
> On Fri, Nov 03, 2006 at 06:36:09PM +0100, Oleg Verych wrote:
> > On Fri, Nov 03, 2006 at 06:09:39PM +0100, Mikulas Patocka wrote:
> > > >In gmane.linux.kernel, you wrote:
> > > >[]
> > > >>From: Andrew Morton <ak...@osdl.org>
> > > >>
[^0] > > > >>As Mikulas points out, (1 << anything) won't be evaluating to zero.

> > > >
> > > >How about integer overflow ?
> > >
> > > C standard defines that shifts by more bits than size of a type are
> > > undefined (in fact 1<<32 produces 1 on i386, because processor uses only 5
> > > bits of a count).
> > ,--
> > |#include <stdio.h>
> > |int main(void) {
> > | unsigned int b = 1;
> > |
> > | printf("%u\n", (1 << 33));
> > | printf("%u\n", (b << 33));
> > | return 0;
> > |}
> > |$ gcc bit.c && ./a.out
> > `--
> >
> > There *is* difference, isn't it?
>
> It's undefined, and the results with your example depend on the gcc
> version and optimization level.
>
> E.g. with gcc 4.1, there is *no* difference any more if you turn on
> optimization.

Sure it is. And it is *zero*, not is stated in [^0].
,--
|olecom@flower:/tmp$ gcc --version
|gcc (GCC) 4.1.2 20060901 (prerelease) (Debian 4.1.1-13)
`--

Hmm. Did i spend more on uC C than PC C? Seem like yes.
So, pay no nevermind, please.

> cu
> Adrian
>
____

Adrian Bunk

unread,
Nov 3, 2006, 2:52:10 PM11/3/06
to Mikulas Patocka
On Fri, Nov 03, 2006 at 02:31:46PM +0100, Mikulas Patocka wrote:
> >>>I am fully aware the counters are effectively 48-bit. If they were
> >>>just 32-bit, you would likely have hit the problem yourself already.
> >>
> >>Given the seek time 0.01s, 31-bit value would last for minimum time of 248
> >>days when doing only syncs and nothing else. 47-bit value will last for
> >>reasonably long.
> >
> >So you can at most do one transaction per drive seek? That would
> >definitely solve the overflow case, but hardly sounds like a
> >high-performance filesystem. :)
>
> Really it can batch any number of modifications into one transaction
> (unless fsync or sync is called). Transaction is closed only on
> fsync/sync, if 2 minutes pass (can be adjusted) or when the disk runs out
> of space.

O_DIRECT ?
O_SYNC ?

> Mikulas

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

-

Paul E. McKenney

unread,
Nov 3, 2006, 3:02:10 PM11/3/06
to Linus Torvalds
On Thu, Nov 02, 2006 at 03:15:33PM -0800, Linus Torvalds wrote:
>
>
> On Thu, 2 Nov 2006, Mikulas Patocka wrote:
> >
> > * There is a rw semaphore that is locked for read for nearly all operations
> > and locked for write only rarely. However locking for read causes cache line
> > pingpong on SMP systems. Do you have an idea how to make it better?

[ . . . ]

> (Seqlocks could be changed to drop the first requirement, although it
> could cause some serious starvation issues, so I'm not sure it's a good
> idea. For RCU the atomic nature is pretty much designed-in.)

I can't help putting in a plug for SRCU, which is in the 2.6.19-rc series,
and which allows readers to sleep. (http://lwn.net/Articles/202847/)

SRCU allows readers and writers to run concurrently (as do all forms of
RCU). If this is a problem, it might be worth looking into Gautham
Shenoy's reader-writer lock built on top of RCU. (A version for hotplug
may be found at http://lkml.org/lkml/2006/10/26/73.) This approach
keeps the reader-writer-lock semantics, but gets rid of cache thrashing.
That said, writers have to wait for a grace period.

And as Linus pointed out, if you have disk I/O involved, you probably
won't notice normal reader-writer-lock overhead.

Thanx, Paul

Christoph Lameter

unread,
Nov 3, 2006, 3:30:46 PM11/3/06
to Mikulas Patocka
On Fri, 3 Nov 2006, Mikulas Patocka wrote:

> I have, I may find them and post them. (but the university wants me to post
> them to some conference, so I should keep them secret :-/)

Ph.D. dissertations are public. The problem is that if you have not
finished yet then you may run into trouble with the condition that the
dissertation must not have been published prior to examination by your
advisors.

Jan Engelhardt

unread,
Nov 3, 2006, 4:53:01 PM11/3/06
to Mikulas Patocka
>> > > > So anyway, why do you need _llseek? Can't you just use lseek()
>> > > > like
>> > > > everyone else?
>> > >
>> > > Because I want it to work with glibc 2.0 that I still use on one
>> > > machine.
>> >
>> > BTW. is it some interaction with symbols defined elsewhere or were
>> > _syscall
>> > macros dropped altogether? Which glibc symbol should I use in #ifdef
>> > to tell if
>> > glibc has 64-bit support?
>>
>> -D_LARGEFILE_SOURCE=1 -D_LARGE_FILES -D_FILE_OFFSET_BITS=64
>>
>> I think the second is not needed.
>
> I see, but the question is how do I test in C preprocessor that glibc is
> sufficiently new to react on them?

__GLIBC_MAJOR__ and __GLIBC_MINOR__


-`J'
--

Oleg Verych

unread,
Nov 3, 2006, 5:01:54 PM11/3/06
to linux-...@vger.kernel.org
On 2006-11-03, Mikulas Patocka wrote:
[]
>> Well, at least for XFS everybody tell that it should be used with UPS only if
>> you really care about your data. I think it has something to do with heavy
>> in-RAM caching this filesystem does.
>
> System is allowed to cache anything unless sync/fsync is called. Someone
> told that XFS has some bugs that if crashed incorrectly, it can lose
> already synced data ... don't know. Plus it has that infamous feature (not
> a bug) that it commits size-increase but not data and you see zero-filed
> files.

AFAIK, XFS by design must provide _file system_ consistency, not data.

[]
>> PS. Do you have any benchmarks of your filesystem? Did you do any longer
>> automated tests to prove it is not going to loose data to easily?


>
> I have, I may find them and post them. (but the university wants me to
> post them to some conference, so I should keep them secret :-/)

Interesting to know, how "another one" FS is quick and stable.

On my new hardware, with dual core CPU 3.4G, 1G of RAM, 1/2TB disk space
(office pro, yes *office* (pro)), `dd' can suck 50M/s, (running and)
extracting 2.6.19-rc2 on fresh 20Gb partition (xfs,jfs) yields nearly 4M/s.

As for ordinary user it seems very slowly.

Ext2 (not ext3) is as fast as shmfs, until RAM will be full.
And after 13 cycles XFS has 11-12 directories with good linux source,
ext2 6-7 (IIRC). On start of 14th cycle i pushed reset button, btw.

Mounting & repairing XFS took less than minute; e2fsck spent more time
on printing, rather than reparing, i think (:.

Want to create another one? OK, but why not to improve existing? >/dev/null
____

Mikulas Patocka

unread,
Nov 3, 2006, 5:43:02 PM11/3/06
to Oleg Verych
> []
>>> PS. Do you have any benchmarks of your filesystem? Did you do any longer
>>> automated tests to prove it is not going to loose data to easily?
>>
>> I have, I may find them and post them. (but the university wants me to
>> post them to some conference, so I should keep them secret :-/)
>
> Interesting to know, how "another one" FS is quick and stable.
>
> On my new hardware, with dual core CPU 3.4G, 1G of RAM, 1/2TB disk space
> (office pro, yes *office* (pro)), `dd' can suck 50M/s, (running and)
> extracting 2.6.19-rc2 on fresh 20Gb partition (xfs,jfs) yields nearly 4M/s.
>
> As for ordinary user it seems very slowly.

So try my filesystem and tell me if I have it better :)

> Ext2 (not ext3) is as fast as shmfs, until RAM will be full.
> And after 13 cycles XFS has 11-12 directories with good linux source,
> ext2 6-7 (IIRC). On start of 14th cycle i pushed reset button, btw.

My Spadfs will leave there snapshot of situation at most 2 minutes ago
(this sync interval can be changed with mount option).

> Mounting & repairing XFS took less than minute; e2fsck spent more time
> on printing, rather than reparing, i think (:.
>
> Want to create another one? OK, but why not to improve existing? >/dev/null

Because redesign is sometimes better.

And you can't generally improve XFS and JFS by simplifying their
structures, because people would scream about incompatibility then.

Mikulas

Jörn Engel

unread,
Nov 4, 2006, 5:46:38 AM11/4/06
to Mikulas Patocka
On Fri, 3 November 2006 20:01:56 +0100, Mikulas Patocka wrote:
> >
> >So which, if I may ask, are the advantages of your design over sprite
> >lfs?
>
> It is very different from LFS. LFS is log-filesystem, i.e. journal spans
> the whole device. The problem with this design is that it's fast for write
> (cool benchmark numbers) and slow in real-world workloads.
>
> LFS places files according to time they were created, not according to
> their directory.
>
> If you have directory with some project where you have files that you
> edited today, day ago, week ago, month ago etc., then any current
> filesystem (even ext2) will try to place files near each other --- while
> LFS will scatter the files over the whole partition according to time they
> were written. --- I believe that this is the reason why log-structured
> filesystems are not in wild use --- this is a case where optimizing for
> benchmark kills real-world performance.

Darn, I was asking the wrong question again. Let me rephrase:

So which, if I may ask, are the advantages of your crash
count/transaction count design over the sprite lfs checkpoint design?

Allocation strategy is an interesting topic as well. Rosenblum and
Ousterhout were wrong in their base assumption that read performance
won't matter long-term, as caches are exponentially growing. It
turned out that storage size was growing just as fast and the ratio
remained roughly the same. But let us postpone that for a while.

Jörn

--
tglx1 thinks that joern should get a (TM) for "Thinking Is Hard"
-- Thomas Gleixner

Jörn Engel

unread,
Nov 4, 2006, 5:53:33 AM11/4/06
to dean gaudet
On Fri, 3 November 2006 11:00:58 -0800, dean gaudet wrote:
>
> it seems to me that you only need to be able to represent a range of the
> most recent 65536 crashes... and could have an online process which goes
> about "refreshing" old objects to move them forward to the most recent
> crash state. as long as you know the minimm on-disk crash count you can
> use it as an offset.

You really don't want to go down that path. Doubling the storage size
will double the work necessary to move old objects - hard to imagine a
design that scales worse.

CPU schedulers, btw, take this approach. But they cheat, as they know
the maximum lifetime of their objects (in-flight instructions, rename
registers,...) is bounded to n. Old objects are refreshed for free.
http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html

Jörn

--
A defeated army first battles and then seeks victory.
-- Sun Tzu

dean gaudet

unread,
Nov 4, 2006, 6:14:44 AM11/4/06
to Jörn Engel
On Sat, 4 Nov 2006, Jörn Engel wrote:

> On Fri, 3 November 2006 11:00:58 -0800, dean gaudet wrote:
> >
> > it seems to me that you only need to be able to represent a range of the
> > most recent 65536 crashes... and could have an online process which goes
> > about "refreshing" old objects to move them forward to the most recent
> > crash state. as long as you know the minimm on-disk crash count you can
> > use it as an offset.
>
> You really don't want to go down that path. Doubling the storage size
> will double the work necessary to move old objects - hard to imagine a
> design that scales worse.

there's no doubling of storage size required.

-dean

Gautham R Shenoy

unread,
Nov 4, 2006, 12:37:32 PM11/4/06
to Mikulas Patocka
On Thu, Nov 02, 2006 at 10:52:47PM +0100, Mikulas Patocka wrote:
> Hi

Hi Mikulas


>
> As my PhD thesis, I am designing and writing a filesystem, and it's now in
> a state that it can be released. You can download it from
> http://artax.karlin.mff.cuni.cz/~mikulas/spadfs/
>

> It has some new features, such as keeping inode information directly in
> directory (until you create hardlink) so that ls -la doesn't seek much,

> new method to keep data consistent in case of crashes (instead of

> journaling), free space is organized in lists of free runs and converted
> to bitmap only in case of extreme fragmentation.
>
> It is not very widely tested, so if you want, test it.
>
> I have these questions:


>
> * There is a rw semaphore that is locked for read for nearly all
> operations and locked for write only rarely. However locking for read
> causes cache line pingpong on SMP systems. Do you have an idea how to make
> it better?
>

> It could be improved by making a semaphore for each CPU and locking for
> read only the CPU's semaphore and for write all semaphores. Or is there a
> better method?

I am currently experimenting with a light-weight reader writer semaphore
with an objective to do away what you call a reader side cache line
"ping pong". It achieves this by using a per-cpu refcount.

A drawback of this approach, as Eric Dumazet mentioned elsewhere in this
thread, would be that each instance of the rw_semaphore would require
(NR_CPUS * size_of(int)) bytes worth of memory in order to keep track of
the per-cpu refcount, which can prove to be pretty costly if this
rw_semaphore is for something like inode->i_alloc_sem.

So the question I am interested in is, how many *live* instances of this
rw_semaphore are you expecting to have at any given time?
If this number is a constant (and/or not very big!), the light-weight
reader writer semaphore might be useful.

Regards
Gautham.
--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"

Eric Dumazet

unread,
Nov 4, 2006, 1:28:17 PM11/4/06
to e...@in.ibm.com
Gautham R Shenoy a écrit :

We might use an hybrid approach : Use a percpu counter if NR_CPUS <= 8

#define refcount_addr(zone, cpu) zone[cpu]

For larger setups, have a fixed limit of 8 counters, and use a modulo

#define refcount_addr(zone, cpu) zone[cpu & 7]

In order not use too much memory, we could use kind of vmalloc() space, using
one PAGE per cpu, so that addr(cpu) = base + (cpu)*PAGE_SIZE;
(vmalloc space allows a NUMA allocation if possible)

So instead of storing in an object a table of 8 pointers, we store only the
address for cpu0.


>
> So the question I am interested in is, how many *live* instances of this
> rw_semaphore are you expecting to have at any given time?
> If this number is a constant (and/or not very big!), the light-weight
> reader writer semaphore might be useful.
>
> Regards
> Gautham.

-

Mikulas Patocka

unread,
Nov 4, 2006, 1:40:50 PM11/4/06
to Eric Dumazet
> The problem with a per_cpu biglock is that you may consume a lot of RAM for
> big NR_CPUS. Count 32 KB per 'biglock' if NR_CPUS=1024

Does one Linux kernel run on system with 1024 cpus? I guess it must fry
spinlocks... (or even lockup due to spinlock livelocks)

Mikulas

Mikulas Patocka

unread,
Nov 4, 2006, 1:46:31 PM11/4/06
to Christoph Lameter
On Fri, 3 Nov 2006, Christoph Lameter wrote:

> On Fri, 3 Nov 2006, Mikulas Patocka wrote:
>
>> I have, I may find them and post them. (but the university wants me to post
>> them to some conference, so I should keep them secret :-/)
>
> Ph.D. dissertations are public. The problem is that if you have not
> finished yet then you may run into trouble with the condition that the
> dissertation must not have been published prior to examination by your
> advisors.

Yes, but many "paperwork" conferences have the requirement, that the work
submitted must not be published before. Does posting it to mailing list
qualify as "publishing" as well?

Mikulas

Mikulas Patocka

unread,
Nov 4, 2006, 1:50:33 PM11/4/06
to Jörn Engel
>>> So which, if I may ask, are the advantages of your design over sprite
>>> lfs?
>>
>> It is very different from LFS. LFS is log-filesystem, i.e. journal spans
>> the whole device. The problem with this design is that it's fast for write
>> (cool benchmark numbers) and slow in real-world workloads.
>>
>> LFS places files according to time they were created, not according to
>> their directory.
>>
>> If you have directory with some project where you have files that you
>> edited today, day ago, week ago, month ago etc., then any current
>> filesystem (even ext2) will try to place files near each other --- while
>> LFS will scatter the files over the whole partition according to time they
>> were written. --- I believe that this is the reason why log-structured
>> filesystems are not in wild use --- this is a case where optimizing for
>> benchmark kills real-world performance.
>
> Darn, I was asking the wrong question again. Let me rephrase:
>
> So which, if I may ask, are the advantages of your crash
> count/transaction count design over the sprite lfs checkpoint design?
>
> Allocation strategy is an interesting topic as well. Rosenblum and
> Ousterhout were wrong in their base assumption that read performance
> won't matter long-term, as caches are exponentially growing. It
> turned out that storage size was growing just as fast and the ratio
> remained roughly the same. But let us postpone that for a while.
>
> Jörn

LFS fragments data by design ... it can't write to already allocated
space, so if you write to the middle of LFS directory, it will allocate
new block, new indirect pointers to that directory, new block in inode
table etc.

The same fragmentation with files (although with files it could be fixed
by not relying on consistecny of their content).

Mikulas

Mikulas Patocka

unread,
Nov 4, 2006, 1:52:40 PM11/4/06
to dean gaudet
>> If it overflows, it increases crash count instead. So really you have 2^47
>> transactions or 65536 crashes and 2^31 transactions between each crash.
>
> it seems to me that you only need to be able to represent a range of the
> most recent 65536 crashes... and could have an online process which goes
> about "refreshing" old objects to move them forward to the most recent
> crash state. as long as you know the minimm on-disk crash count you can
> use it as an offset.

After 65536 crashes you have to run spadfsck --reset-crash-counts. Maybe I
add that functionality to kernel driver too, so that it will be formally
corect.

Mikulas

> -dean
>
> p.s. i could see a network device with spotty connectivity causing a large
> bump in crash count in a short period of time.
>

Grzegorz Kulewski

unread,
Nov 4, 2006, 1:57:27 PM11/4/06
to Mikulas Patocka
On Sat, 4 Nov 2006, Mikulas Patocka wrote:
>> > If it overflows, it increases crash count instead. So really you have
>> > 2^47
>> > transactions or 65536 crashes and 2^31 transactions between each crash.
>>
>> it seems to me that you only need to be able to represent a range of the
>> most recent 65536 crashes... and could have an online process which goes
>> about "refreshing" old objects to move them forward to the most recent
>> crash state. as long as you know the minimm on-disk crash count you can
>> use it as an offset.
>
> After 65536 crashes you have to run spadfsck --reset-crash-counts. Maybe I
> add that functionality to kernel driver too, so that it will be formally
> corect.

Is there any reason you can not make these fields 64 or even 128 bits in
size to increase these "limits" dramatically?


Thanks,

Grzegorz Kulewski

Mikulas Patocka

unread,
Nov 4, 2006, 2:18:46 PM11/4/06
to Grzegorz Kulewski

On Sat, 4 Nov 2006, Grzegorz Kulewski wrote:

> On Sat, 4 Nov 2006, Mikulas Patocka wrote:
>>> > If it overflows, it increases crash count instead. So really you have >
>>> 2^47
>>> > transactions or 65536 crashes and 2^31 transactions between each crash.
>>>
>>> it seems to me that you only need to be able to represent a range of the
>>> most recent 65536 crashes... and could have an online process which goes
>>> about "refreshing" old objects to move them forward to the most recent
>>> crash state. as long as you know the minimm on-disk crash count you can
>>> use it as an offset.
>>
>> After 65536 crashes you have to run spadfsck --reset-crash-counts. Maybe I
>> add that functionality to kernel driver too, so that it will be formally
>> corect.
>
> Is there any reason you can not make these fields 64 or even 128 bits in size
> to increase these "limits" dramatically?

Yes

First --- you need a table of 65536 entries. Table of 4G entries would be
too large.
Second --- it will make structures larger and thus some operations (like
scanning directory with find) slower.

Mikulas

Albert Cahalan

unread,
Nov 4, 2006, 3:00:44 PM11/4/06
to kan...@polcom.net, mik...@artax.karlin.mff.cuni.cz, linux-...@vger.kernel.org
Re: New filesystem for Linux

kan...@polcom.net, mik...@artax.karlin.mff.cuni.cz,
linux-...@vger.kernel.org


Grzegorz Kulewski writes:


> On Thu, 2 Nov 2006, Mikulas Patocka wrote:
>> As my PhD thesis, I am designing and writing a filesystem,
>> and it's now in a state that it can be released. You can
>> download it from http://artax.karlin.mff.cuni.cz/~mikulas/spadfs/
>

> "Disk that can atomically write one sector (512 bytes) so that
> the sector contains either old or new content in case of crash."

New drives will soon use 4096-byte sectors. This is a better
match for the normal (non-VAX!) page size and reduces overhead.

> Well, maybe I am completly wrong but as far as I understand no
> disk currently will provide such requirement. Disks can have
> (after halted write):
> - old data,
> - new data,
> - nothing (unreadable sector - result of not full write and disk
> internal checksum failute for that sector, happens especially
> often if you have frequent power outages).
>
> And possibly some broken drives may also return you something that
> they think is good data but really is not (shouldn't happen since
> both disks and cables should be protected by checksums, but hey...
> you can never be absolutely sure especially on very big storages).
>
> So... isn't this making your filesystem a little flawed in design?

It's equal to typical modern designs like JFS, NTFS, XFS, and Reiserfs.

It's much worse than ZFS, which doesn't even trust the filesystem
to not silently scramble the data.

It's a tad worse than various forms of ext2/ext3/ufs/ufs2, where a
fixed layout helps recovery. (ext2 lacks the atomic updates, but
it doesn't trust a bad shutdown either -- fsck will always run)

BTW, a person with disk recovery experience told me that drives
will sometimes reorder the sectors. Sector 42 becomes sector 7732,
sector 880880 becomes sector 12345, etc. The very best filesystems
can handle with without data loss. (for example, ZFS) Merely great
filesystems will at least recognize that the data has been trashed.

Jörn Engel

unread,
Nov 4, 2006, 3:07:55 PM11/4/06
to dean gaudet

Oh, I double my storage size every few years when I buy a new disk.
And I would hate it if my filesystem became slower each time.

Jörn

--
Joern's library part 13:
http://www.chip-architect.com/

Jan-Benedict Glaw

unread,
Nov 4, 2006, 4:01:36 PM11/4/06
to Albert Cahalan
On Sat, 2006-11-04 14:59:53 -0500, Albert Cahalan <acah...@gmail.com> wrote:
> Grzegorz Kulewski writes:
> > On Thu, 2 Nov 2006, Mikulas Patocka wrote:
> > > As my PhD thesis, I am designing and writing a filesystem,
> > > and it's now in a state that it can be released. You can
> > > download it from http://artax.karlin.mff.cuni.cz/~mikulas/spadfs/
> >
> > "Disk that can atomically write one sector (512 bytes) so that
> > the sector contains either old or new content in case of crash."
>
> New drives will soon use 4096-byte sectors. This is a better
> match for the normal (non-VAX!) page size and reduces overhead.

Well... VAXen use physical PTEs for 512 byte pages, Linux uses 8
consecutive pages to simulate 4K pages.

On top of that, some of today's machines have configurable page sizes.
Besides that, 512 byte sectors are quite a clever thing: Drives
probably can write a number of consecutive sectors, so if you want to
send a page, just send eight sectors.

> BTW, a person with disk recovery experience told me that drives
> will sometimes reorder the sectors. Sector 42 becomes sector 7732,
> sector 880880 becomes sector 12345, etc. The very best filesystems
> can handle with without data loss. (for example, ZFS) Merely great
> filesystems will at least recognize that the data has been trashed.

Uh? This should be transparent to the host computer, so logical sector
numbers won't change.

MfG, JBG

--
Jan-Benedict Glaw jbg...@lug-owl.de +49-172-7608481
Signature of: http://catb.org/~esr/faqs/smart-questions.html
the second :

signature.asc

Mikulas Patocka

unread,
Nov 4, 2006, 6:38:45 PM11/4/06
to Albert Cahalan
> Re: New filesystem for Linux
>
> kan...@polcom.net, mik...@artax.karlin.mff.cuni.cz,
> linux-...@vger.kernel.org
>
>
> Grzegorz Kulewski writes:
>> On Thu, 2 Nov 2006, Mikulas Patocka wrote:
>>> As my PhD thesis, I am designing and writing a filesystem,
>>> and it's now in a state that it can be released. You can
>>> download it from http://artax.karlin.mff.cuni.cz/~mikulas/spadfs/
>>
>> "Disk that can atomically write one sector (512 bytes) so that
>> the sector contains either old or new content in case of crash."
>
> New drives will soon use 4096-byte sectors. This is a better
> match for the normal (non-VAX!) page size and reduces overhead.

The drive (IDE model, SCSI can have larger sector size) will do
read-modify-write for smaller writes. So there should be no compatibility
issues. (this possibility is in new ATA standard and there is a way how to
detect physical sector size)

But how will fdisk deal with it? Fdisk by default aligns partitions on
63-sector boundary, so it will make all sectors misaligned and seriously
kill performance even if filesystem uses proper 8-sector aligned accesses.

Mikulas

Kyle Moffett

unread,
Nov 4, 2006, 6:49:12 PM11/4/06
to Mikulas Patocka
On Nov 04, 2006, at 18:38:11, Mikulas Patocka wrote:
> But how will fdisk deal with it? Fdisk by default aligns partitions
> on 63-sector boundary, so it will make all sectors misaligned and
> seriously kill performance even if filesystem uses proper 8-sector
> aligned accesses.

Don't use a partition-table format that dates back to drives with
actual reported physical geometry and which also maxed out at 2MB or
so? Even the mac-format partition tables (which aren't that much
newer) don't care about physical drive geometry.

Besides, unless you're running DOS, Windows 95, or some random
ancient firmware that looks at your partition tables or whatever you
can just tell fdisk to ignore the 63-sector-alignment constraint and
align your partitions more efficiently anyways. But if you're
dealing with hardware so new it supports 4k or 8k sectors, you really
should be using EFI or something.

Cheers,
Kyle Moffett

Linus Torvalds

unread,
Nov 4, 2006, 7:54:59 PM11/4/06
to Mikulas Patocka

On Thu, 2 Nov 2006, Mikulas Patocka wrote:
>
> As my PhD thesis, I am designing and writing a filesystem, and it's now in a
> state that it can be released. You can download it from
> http://artax.karlin.mff.cuni.cz/~mikulas/spadfs/

Ok, not having actually tested any of this, I only have a few comments on
the source code:

- the source tree layout is very confusing. Can you please separate the
mkfs/fsck parts more clearly from the kernel driver parts?

- you have a _very_ confusing usage of upper-case. Not only are a lot of
functions upper-case, some filenames are also upper-case. What would
otherwise be more readable just ends up being hard to read because it's
so odd and unexpected.

I'm sure there is some logic to it, but it escapes me.

- your whitespace usage needs some work: please put empty lines between
the declarations and the code in a function, and since you use a fair
amount of "goto"s, please do NOT indent them into the code (it's almost
impossible to pick out the target labels because you hide them with the
code).

- your whitespace, part 2: you have a fair number of one-liner
if-statements, where again there is no indentation, and thus the flow
is almost impossible to see. Don't wrote

if (somecomplexconditional) return;

but please instead write

if (somecomplexcondifional)
return;

and perhaps use a few more empty lines to separate out the "paragraphs"
of code (the same way you write email - nobody wants to see one solid
block of code, you'd prefer to see "logical sections").

Here's a prime example of what NOT to do:

if (__likely(!(((*c)[1] - 1) & (*c)[1]))) (*c)[0] = key;

I dare anybody to be able to read that. That wasn't even the worst one:
some of those if-statements were so long that you couldn't even _see_
what the statement inside the if-statement even was (and I don't use a
80-column wide terminal, this was in a 112-column xterm)

- why use "__d_off" etc hard-to-read types? You seem to have typedef'ed
it from sector_t, but you use a harder-to-read name than the original
type was. Hmm?

- you have a few comments, but you could have a lot more explanation,
especially since not all of your names are all that self-explanatory.

Ok, with that out of the way, let's say what I _like_ about it:

- it's fairly small

- the code, while having the above problems, looks generally fairly
clean. The whitespace issues get partially cleared by just running
"Lindent" on it, although that's not perfect either (it still indents
the goto target labels too much, although it at least makes them
_visible_. But it won't add empty lines to delineate sections, of
course, and it doesn't add comments ;^)

- I like a lot of the notions, and damn, small and simple are both
virtues on their own.

So if you could make the code easier to read, and were to do some
benchmarking to show what it's good at and what the problems are, I think
you'd find people looking at it. It doesn't look horrible to me.

Linus

Alan Cox

unread,
Nov 4, 2006, 8:54:19 PM11/4/06
to Albert Cahalan
Ar Sad, 2006-11-04 am 14:59 -0500, ysgrifennodd Albert Cahalan:
> New drives will soon use 4096-byte sectors. This is a better
> match for the normal (non-VAX!) page size and reduces overhead.

The innards of the disk are already a file system in themselves. The
disk is a network storage appliance on a funny cable with an odd command
set, nothing else. The internal storage model of the drive and external
one can be quite different.

> > - old data,
> > - new data,
> > - nothing (unreadable sector - result of not full write and disk
> > internal checksum failute for that sector, happens especially
> > often if you have frequent power outages).

Also theoretically states like "random block you only read in the middle
of moving".

> > And possibly some broken drives may also return you something that
> > they think is good data but really is not (shouldn't happen since
> > both disks and cables should be protected by checksums, but hey...
> > you can never be absolutely sure especially on very big storages).

It happens because
- There is limited if any protection on the PCI bus generally
- Many PC systems don't have ECC memory, ECC cache
- PATA does not CRC protect the command block so if you do enough PATA
I/O (eg you are a US national lab ..) you *will* eventually get a bit
flip that gives you the wrong sector with no error data. SATA fixes that
one.
- Murphy is out to get you..

Network people use end to end checksums, when working with huge data
sets people generally use app<->app checksums. Serious network using
people with big data sets also often turn off the hardware checksum
support on network cards - its faster but riskier.

> BTW, a person with disk recovery experience told me that drives
> will sometimes reorder the sectors. Sector 42 becomes sector 7732,
> sector 880880 becomes sector 12345, etc. The very best filesystems

Not seen that, although they do move stuff aorund in their internal
block management of bad blocks. I've also seen hardware errors that lead
to data being messed up silently.

Alan

Mikulas Patocka

unread,
Nov 4, 2006, 11:14:51 PM11/4/06
to Linus Torvalds
On Sat, 4 Nov 2006, Linus Torvalds wrote:

> On Thu, 2 Nov 2006, Mikulas Patocka wrote:
>>
>> As my PhD thesis, I am designing and writing a filesystem, and it's now in a
>> state that it can be released. You can download it from
>> http://artax.karlin.mff.cuni.cz/~mikulas/spadfs/
>
> Ok, not having actually tested any of this, I only have a few comments on
> the source code:
>
> - the source tree layout is very confusing. Can you please separate the
> mkfs/fsck parts more clearly from the kernel driver parts?

Yes, fsck is already separated, mkfs could be too.

> - you have a _very_ confusing usage of upper-case. Not only are a lot of
> functions upper-case, some filenames are also upper-case. What would
> otherwise be more readable just ends up being hard to read because it's
> so odd and unexpected.
>
> I'm sure there is some logic to it, but it escapes me.

I'm used to this. I usually make important functions with uppercase
letters and nonimportant temporary functions with lowercase letters.

But I see that it contradicts general kernel coding style, so I can change
it.

BTW do you find uppercase typedefs like
typedef struct {
...
} SPADFNODE;
confusing too?

Uppercase filenames are there because the files are taken from another
(not yet released) project. But the kernel driver does not share any code
except definitions of disk structures, I saw how badly an attempt to share
kernel code affected XFS.

> - your whitespace usage needs some work: please put empty lines between
> the declarations and the code in a function, and since you use a fair
> amount of "goto"s, please do NOT indent them into the code (it's almost
> impossible to pick out the target labels because you hide them with the
> code).
>
> - your whitespace, part 2: you have a fair number of one-liner
> if-statements, where again there is no indentation, and thus the flow
> is almost impossible to see. Don't wrote
>
> if (somecomplexconditional) return;
>
> but please instead write
>
> if (somecomplexcondifional)
> return;
>
> and perhaps use a few more empty lines to separate out the "paragraphs"
> of code (the same way you write email - nobody wants to see one solid
> block of code, you'd prefer to see "logical sections").
>
> Here's a prime example of what NOT to do:
>
> if (__likely(!(((*c)[1] - 1) & (*c)[1]))) (*c)[0] = key;
>
> I dare anybody to be able to read that. That wasn't even the worst one:
> some of those if-statements were so long that you couldn't even _see_
> what the statement inside the if-statement even was (and I don't use a
> 80-column wide terminal, this was in a 112-column xterm)

I see, that is fixable easily.

> - why use "__d_off" etc hard-to-read types? You seem to have typedef'ed
> it from sector_t, but you use a harder-to-read name than the original
> type was. Hmm?

I am used to __d_off from elsewhere. The same reason why I use
__likely/__unlikely instead of likely/unlikely.

__d_off may have some little meaning --- if someone wants to run 32-bit
spadfs filesystem on a kernel configuration with 64-bit sector_t. But I'm
not sure if someone would ever want it.

> - you have a few comments, but you could have a lot more explanation,
> especially since not all of your names are all that self-explanatory.
>
> Ok, with that out of the way, let's say what I _like_ about it:
>
> - it's fairly small
>
> - the code, while having the above problems, looks generally fairly
> clean. The whitespace issues get partially cleared by just running
> "Lindent" on it, although that's not perfect either (it still indents
> the goto target labels too much, although it at least makes them
> _visible_. But it won't add empty lines to delineate sections, of
> course, and it doesn't add comments ;^)
>
> - I like a lot of the notions, and damn, small and simple are both
> virtues on their own.
>
> So if you could make the code easier to read, and were to do some
> benchmarking to show what it's good at and what the problems are, I think
> you'd find people looking at it. It doesn't look horrible to me.

I placed some benchmark on
http://artax.karlin.mff.cuni.cz/~mikulas/spadfs/benchmarks/

The main shortcoming: slow fsync. fsync on spadfs generally has to flush
all metadata buffers (it could be improved at least for case when file
size does not change --- for databases).

Mikulas

Willy Tarreau

unread,
Nov 5, 2006, 3:36:28 AM11/5/06
to Mikulas Patocka
On Sun, Nov 05, 2006 at 05:14:06AM +0100, Mikulas Patocka wrote:
> On Sat, 4 Nov 2006, Linus Torvalds wrote:
> >- you have a _very_ confusing usage of upper-case. Not only are a lot of
> > functions upper-case, some filenames are also upper-case. What would
> > otherwise be more readable just ends up being hard to read because it's
> > so odd and unexpected.
> >
> > I'm sure there is some logic to it, but it escapes me.
>
> I'm used to this. I usually make important functions with uppercase
> letters and nonimportant temporary functions with lowercase letters.
>
> But I see that it contradicts general kernel coding style, so I can change
> it.

We're more used to have uppercase for macros (or enums) and lowercase for
the rest. That way, when you read NULL, KERN_WARNING, PAGE_CACHE_SIZE,
INIT_LIST_HEAD..., you know that you're dealing with a macro, which is
very convenient.

> BTW do you find uppercase typedefs like
> typedef struct {
> ...
> } SPADFNODE;
> confusing too?

Yes for the reason above. Also, we don't much use type definitions for
structures, because it's easier to understand "struct spadfnode *node"
in a function declaration than "SPADFNODE *node". Take a look at other
file systems. You'll see lots of "struct inode", "struct buffer_head".
Sometimes, you'll find some types suffixed with "_t", such as "handle_t"
or "spinlock_t". Generally, they are used for very commonly used data
(such as spinlocks), or opaque data which are passed between functions
without any interpretation. But it's more from observation than a rule.

> Uppercase filenames are there because the files are taken from another
> (not yet released) project. But the kernel driver does not share any code
> except definitions of disk structures, I saw how badly an attempt to share
> kernel code affected XFS.

It's better to avoid uppercases in file names. I recently had to help
a user who could not build his kernel because of strange errors. I
finally found out that he was building from "entry.s" instead of "entry.S",
which suggested he copied the tree on a FAT. He confirmed having used a
vfat-formatted USB stick to copy his tree. Such errors are very annoying
to debug.

(...)

Not bad at all it seems !

Regards,
Willy

James Courtier-Dutton

unread,
Nov 5, 2006, 6:17:04 AM11/5/06
to Alan Cox
Alan Cox wrote:
<snip>

>
> Not seen that, although they do move stuff aorund in their internal
> block management of bad blocks. I've also seen hardware errors that lead
> to data being messed up silently.
>
> Alan
>

I have seen this too. I think that when IDE drive relocates the sector
due to hard errors, one would silently loose the information that was
stored in that sector.
How can one detect this? Of course it would be nice if the IDE drive
told us that sector X had just gone bad but I don't think they do. They
just silently relocate it because in some cases the sector has only gone
a "bit" bad, so the IDE drive relocates it before it totally fails.

I suppose a work around is to provide a fs level error check. This could
take the form of the fs adding a checksum to any file. To avoid recheck
summing the entire file each time it changes, maybe break the file up
into blocks and checksum those. This would slow things down due to CPU
use for the checksum, but at least we could tell us as soon as a file
became corrupted, as the verification could be done on reading the file.

Another possible solution could be using a few bytes from each sector to
place a fs level checksum in. Then, if the IDE drive silently relocates
the sector, the fs level checksum will fail. A saw a feature like this
on some old filesystem, but I don't remember which. It placed a
checksum, forwards chain link, and possibly backwards chain link. So, if
the filesystem became really badly corrupted, one could pick any sector
on the disk and recover the entire file associated with it.
I seem to remember that OS/2 used a 32bit forwards chain, but not the
checksum.

James

Brad Campbell

unread,
Nov 5, 2006, 6:29:04 AM11/5/06
to James Courtier-Dutton
James Courtier-Dutton wrote:

> I have seen this too. I think that when IDE drive relocates the sector
> due to hard errors, one would silently loose the information that was
> stored in that sector.
> How can one detect this? Of course it would be nice if the IDE drive
> told us that sector X had just gone bad but I don't think they do. They
> just silently relocate it because in some cases the sector has only gone
> a "bit" bad, so the IDE drive relocates it before it totally fails.

I've never seen this behaviour in a drive. All the drives I've seen mark bad sectors as "pending
reallocation", they they return read errors on that sector unless they manage to jag a good read, in
which case they then reallocate the sector. Or else they wait for you to write to the sector
triggering a reallocation.

There may be drives less well behaved out there, but I've not come across them.

Brad
--
"Human beings, who are almost unique in having the ability
to learn from the experience of others, are also remarkable
for their apparent disinclination to do so." -- Douglas Adams

Jan Engelhardt

unread,
Nov 5, 2006, 6:34:40 AM11/5/06
to Willy Tarreau

>> BTW do you find uppercase typedefs like
>> typedef struct {
>> ...
>> } SPADFNODE;
>> confusing too?
>
>Yes for the reason above. Also, we don't much use type definitions for
>structures, because it's easier to understand "struct spadfnode *node"
>in a function declaration than "SPADFNODE *node".

It gets worse when code authors begin to use

typedef struct { ... } MYSTRUCT, *PMYSTRUCT, **PPMYSTRUCT;

Most certainly you will run into "passing argument from incompatible
pointer"[1] and "request for member ■a■ in something not a structure or
union"[2] and "invalid type argument of ■->■"[3] (BTW I hate gcc using
Unicode chars in its output which are not displayed in the console):

struct foo {
int bar;
} ST, *PST;
void foobar(ST a) { // [1]
a->bar = 1;
foobar2(a); // [3]
}
void foobar2(PST a) { // [2]
a.bar = 1;
}

So I much rather like to see all the 'funky stars' (struct foo *) in the
parameter list, instead of trying to keep track of how many of them a
PST carries.

-`J'
--

Theodore Tso

unread,
Nov 5, 2006, 7:02:56 AM11/5/06
to Mikulas Patocka
On Sat, Nov 04, 2006 at 07:46:05PM +0100, Mikulas Patocka wrote:
> On Fri, 3 Nov 2006, Christoph Lameter wrote:
>
> >On Fri, 3 Nov 2006, Mikulas Patocka wrote:
> >
> >>I have, I may find them and post them. (but the university wants me to
> >>post
> >>them to some conference, so I should keep them secret :-/)
> >
> >Ph.D. dissertations are public. The problem is that if you have not
> >finished yet then you may run into trouble with the condition that the
> >dissertation must not have been published prior to examination by your
> >advisors.
>
> Yes, but many "paperwork" conferences have the requirement, that the work
> submitted must not be published before. Does posting it to mailing list
> qualify as "publishing" as well?

The requirement by the vast majority of conferences is that it must
not have been published in a peer-reviewed journal or conference, and
you most not simultaneously have the paper under consideration by more
than one peer-reviewed journal or conference. In general, though, you
can publish work on a web page, mailing list, or even as an
un-reviewed Techncial Report published by your department, without
harming your chances of submission to a peer-reviewed
conference/journal. Check with the program committee of the
conference to be sure, but that's way it general works.

- Ted

Bruno Cesar Ribas

unread,
Nov 5, 2006, 9:49:43 AM11/5/06
to Mikulas Patocka
On Sun, Nov 05, 2006 at 05:14:06AM +0100, Mikulas Patocka wrote:

Why don't you test with Bonnie++?! I think we would get interesting results too
:)
something like
bonnie -u $USER -s 2048 -n 40:100k -d /path/to/mounted/test/

i think we would get interesting results!

And i think you got interesting results there!!! If 'we' have it working on
SMP it would be more interesting :)

Bruno

>
> The main shortcoming: slow fsync. fsync on spadfs generally has to flush
> all metadata buffers (it could be improved at least for case when file
> size does not change --- for databases).
>
> Mikulas
>
> > Linus
> >
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majo...@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Bruno Ribas - ri...@c3sl.ufpr.br
http://web.inf.ufpr.br/ribas
C3SL: http://www.c3sl.ufpr.br

Albert Cahalan

unread,
Nov 5, 2006, 11:23:07 AM11/5/06
to James Courtier-Dutton
On 11/5/06, James Courtier-Dutton <Ja...@superbug.co.uk> wrote:

> I have seen this too. I think that when IDE drive relocates the sector
> due to hard errors, one would silently loose the information that was
> stored in that sector.

I didn't just mean hidden relocation of bad blocks.

I really meant that sectors can trade places. This is probably
what happens when the bad-block remapping is itself corrupt.

Inodes 16,17,18,19 trade places with inodes 40,41,42,43.
An indirect block for your database trades places with an
indirect block for an outgoing email. A chunk of /etc/shadow
trades places with a chunk of a user's ~/.bash_logout file.

> I suppose a work around is to provide a fs level error check. This could
> take the form of the fs adding a checksum to any file. To avoid recheck
> summing the entire file each time it changes, maybe break the file up
> into blocks and checksum those. This would slow things down due to CPU
> use for the checksum, but at least we could tell us as soon as a file
> became corrupted, as the verification could be done on reading the file.

Yes indeed. This is what ZFS does. You can choose between
regular and crypto checksums. You can cause the filesystem
to replicate blocks over multiple devices. Unlike RAID, this lets
you recover from silent corruption.

> Another possible solution could be using a few bytes from each sector to
> place a fs level checksum in. Then, if the IDE drive silently relocates
> the sector, the fs level checksum will fail. A saw a feature like this
> on some old filesystem, but I don't remember which. It placed a
> checksum, forwards chain link, and possibly backwards chain link. So, if
> the filesystem became really badly corrupted, one could pick any sector
> on the disk and recover the entire file associated with it.

Both OS/400 and AmigaOS had this feature. AmigaOS used regular
512-byte sectors, making the data portion oddly sized. OS/400 made
the sectors bigger, by another 8 or 16 bytes if I remember right.

Albert Cahalan

unread,
Nov 5, 2006, 11:38:44 AM11/5/06
to Albert Cahalan, kan...@polcom.net, mik...@artax.karlin.mff.cuni.cz, linux-...@vger.kernel.org

"should be" does not imply "won't" :-)

On a drive which is capable of remapping sectors, imagine what
happens if the remapping data itself is corrupted. (the user data
is perfectly fine and is not being relocated)

What I mean is that the logical sector numbers not only change,
but they are the only thing changing. The user data never moves
to a different physical location, and is never intended to move.
The user data is perfectly readable. It just appears in the wrong
place as viewed by the OS.

Mikulas Patocka

unread,
Nov 5, 2006, 12:18:39 PM11/5/06
to Alan Cox
>>> And possibly some broken drives may also return you something that
>>> they think is good data but really is not (shouldn't happen since
>>> both disks and cables should be protected by checksums, but hey...
>>> you can never be absolutely sure especially on very big storages).
>
> It happens because
> - There is limited if any protection on the PCI bus generally
> - Many PC systems don't have ECC memory, ECC cache
> - PATA does not CRC protect the command block so if you do enough PATA
> I/O (eg you are a US national lab ..) you *will* eventually get a bit
> flip that gives you the wrong sector with no error data. SATA fixes that
> one.
> - Murphy is out to get you..

Should IDE driver read back parameters after writing them before issuing
the command? That should fix this problem. (except when command is written
badly)

> Not seen that, although they do move stuff aorund in their internal
> block management of bad blocks. I've also seen hardware errors that lead
> to data being messed up silently.

I have seen one WD drive bought in 2003 having error in its firmware in
cache-coherency code --- if you read and write 256 sectors to the same
places with some pattern repeatedly (with direct IO), it will discard a
write. It happens only with 256-sector writes, maybe some part of firmware
treats 256 as 0. Maybe I create testcase sometimes.

Mikulas

Alan Cox

unread,
Nov 5, 2006, 1:10:54 PM11/5/06
to Mikulas Patocka
Ar Sul, 2006-11-05 am 18:18 +0100, ysgrifennodd Mikulas Patocka:
> Should IDE driver read back parameters after writing them before issuing
> the command? That should fix this problem. (except when command
> is written badly)

For "normal" usage I suspect not - the lack of ECC memory is probably
more serious as is the lack of PCI parity checking.

Alan Cox

unread,
Nov 5, 2006, 2:10:58 PM11/5/06
to Mikulas Patocka
Ar Sul, 2006-11-05 am 19:18 +0100, ysgrifennodd Mikulas Patocka:
> I meant for the corruption on IDE cable ... I have 5 UDMA CRC errors over
> 8 years of usage, so I can imagine someone could have command parameters
> corrupted too --- or are they transmitted in a more reliable way?

Much much slower and much less data per transfer.

H. Peter Anvin

unread,
Nov 5, 2006, 3:17:20 PM11/5/06
to Maurizio Lombardi
Maurizio Lombardi wrote:
> On 11/4/06, Mikulas Patocka <mik...@artax.karlin.mff.cuni.cz> wrote:
>
>> free space is organized in lists of free runs
>> and converted to bitmap only in case of
>> extreme fragmentation.
>
> There is a performance reason to prefer lists of free blocks rather than
> bitmap?
>
> I read from [Tanenbaum: Operating System, Design and Implementation II
> ed. ] that lists are better than bitmap only when disk is almost full.
>

Yes, if you have a truly random access medium.

If you have media like physical disks, where fragmentation costs you,
the lists will kill you dead in no time at all.

-hpa

H. Peter Anvin

unread,
Nov 5, 2006, 3:26:59 PM11/5/06
to Kyle Moffett
Kyle Moffett wrote:
> On Nov 04, 2006, at 18:38:11, Mikulas Patocka wrote:
>> But how will fdisk deal with it? Fdisk by default aligns partitions on
>> 63-sector boundary, so it will make all sectors misaligned and
>> seriously kill performance even if filesystem uses proper 8-sector
>> aligned accesses.
>
> Don't use a partition-table format that dates back to drives with actual
> reported physical geometry and which also maxed out at 2MB or so? Even
> the mac-format partition tables (which aren't that much newer) don't
> care about physical drive geometry.
>
> Besides, unless you're running DOS, Windows 95, or some random ancient
> firmware that looks at your partition tables or whatever you can just
> tell fdisk to ignore the 63-sector-alignment constraint and align your
> partitions more efficiently anyways. But if you're dealing with
> hardware so new it supports 4k or 8k sectors, you really should be using
> EFI or something.
>
> Cheers,
> Kyle Moffett
>

Actually, DOS/Win9x should handle arbitrary alignment just fine (except
possibly some very very old versions of DOS which assumed that the first
four sectors of IO.SYS all fell within the same track -- but I'm pretty
sure that the FORMAT and SYS programs would align it for you.)

-hpa

Rene Herman

unread,
Nov 5, 2006, 4:28:58 PM11/5/06
to H. Peter Anvin
H. Peter Anvin wrote:

[ partitions ]

> Actually, DOS/Win9x should handle arbitrary alignment just fine

For primary (and extended) partitions, yes. I haven't used any version
of DOS that has ever objected to arbitrarily aligned partitions in the
MBR (and I do align them arbitrarily since I always make my partitions
some exact size and start the next partition in the next sector).

Different though for logical partitions inside an extended. As late as
Windows 98, DOS would object to non-aligned logicals, at the very least
with some settings for the BIOS use/don't use LBA or "Large" settings.

Linux doesn't care; I've used type 0x85 instead of 0x05 for my extended
partitions dus to that for years. DOS just ignores that one...

Rene

Pavel Machek

unread,
Nov 5, 2006, 4:50:15 PM11/5/06
to Mikulas Patocka
Hi!

> >>>As my PhD thesis, I am designing and writing a filesystem,
> >>>and it's now in a state that it can be released. You can
> >>>download it from http://artax.karlin.mff.cuni.cz/~mikulas/spadfs/
> >>
> >>"Disk that can atomically write one sector (512 bytes) so that
> >>the sector contains either old or new content in case of crash."
> >
> >New drives will soon use 4096-byte sectors. This is a better
> >match for the normal (non-VAX!) page size and reduces overhead.
>
> The drive (IDE model, SCSI can have larger sector size) will do
> read-modify-write for smaller writes. So there should be no compatibility
> issues. (this possibility is in new ATA standard and there is a way how to

Actually, there are. If you assume powerfail can only destroy 512
bytes... read-modify-write of 4K is going to destroy your "only 512
bytes destroyed" assumption...
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

H. Peter Anvin

unread,
Nov 5, 2006, 4:52:03 PM11/5/06
to Rene Herman
Rene Herman wrote:
> H. Peter Anvin wrote:
>
> [ partitions ]
>
>> Actually, DOS/Win9x should handle arbitrary alignment just fine
>
> For primary (and extended) partitions, yes. I haven't used any version
> of DOS that has ever objected to arbitrarily aligned partitions in the
> MBR (and I do align them arbitrarily since I always make my partitions
> some exact size and start the next partition in the next sector).
>
> Different though for logical partitions inside an extended. As late as
> Windows 98, DOS would object to non-aligned logicals, at the very least
> with some settings for the BIOS use/don't use LBA or "Large" settings.
>
> Linux doesn't care; I've used type 0x85 instead of 0x05 for my extended
> partitions dus to that for years. DOS just ignores that one...
>

DOS, or FDISK?

-hpa

Paul E. McKenney

unread,
Nov 5, 2006, 5:54:22 PM11/5/06
to Eric Dumazet
On Sat, Nov 04, 2006 at 07:27:48PM +0100, Eric Dumazet wrote:
> Gautham R Shenoy a écrit :
> >On Thu, Nov 02, 2006 at 10:52:47PM +0100, Mikulas Patocka wrote:
> >>Hi
> >
> >Hi Mikulas

> >>As my PhD thesis, I am designing and writing a filesystem, and it's now
> >>in a state that it can be released. You can download it from
> >>http://artax.karlin.mff.cuni.cz/~mikulas/spadfs/
> >>
> >>It has some new features, such as keeping inode information directly in
> >>directory (until you create hardlink) so that ls -la doesn't seek much,
> >>new method to keep data consistent in case of crashes (instead of
> >>journaling), free space is organized in lists of free runs and converted
> >>to bitmap only in case of extreme fragmentation.
> >>
> >>It is not very widely tested, so if you want, test it.
> >>
> >>I have these questions:
> >>
> >>* There is a rw semaphore that is locked for read for nearly all
> >>operations and locked for write only rarely. However locking for read
> >>causes cache line pingpong on SMP systems. Do you have an idea how to
> >>make it better?
> >>
> >>It could be improved by making a semaphore for each CPU and locking for
> >>read only the CPU's semaphore and for write all semaphores. Or is there a
> >>better method?
> >
> >I am currently experimenting with a light-weight reader writer semaphore
> >with an objective to do away what you call a reader side cache line
> >"ping pong". It achieves this by using a per-cpu refcount.
> >
> >A drawback of this approach, as Eric Dumazet mentioned elsewhere in this
> >thread, would be that each instance of the rw_semaphore would require
> >(NR_CPUS * size_of(int)) bytes worth of memory in order to keep track of
> >the per-cpu refcount, which can prove to be pretty costly if this
> >rw_semaphore is for something like inode->i_alloc_sem.
>
> We might use an hybrid approach : Use a percpu counter if NR_CPUS <= 8
>
> #define refcount_addr(zone, cpu) zone[cpu]
>
> For larger setups, have a fixed limit of 8 counters, and use a modulo
>
> #define refcount_addr(zone, cpu) zone[cpu & 7]
>
> In order not use too much memory, we could use kind of vmalloc() space,
> using one PAGE per cpu, so that addr(cpu) = base + (cpu)*PAGE_SIZE;
> (vmalloc space allows a NUMA allocation if possible)

The fact that counters are shared forces use of atomic instructions.

If the situation is highly read-intensive, another memory-saving
approach would be to share the "lock" among multiple inodes, for
example, hashing the inode address. That way there would be NR_CPUS
counters per hash bucket, but (hopefully) far fewer hash buckets
than inodes.

Thanx, Paul

> So instead of storing in an object a table of 8 pointers, we store only the
> address for cpu0.
>
>
> >
> >So the question I am interested in is, how many *live* instances of this
> >rw_semaphore are you expecting to have at any given time?
> >If this number is a constant (and/or not very big!), the light-weight
> >reader writer semaphore might be useful.
> >
> >Regards
> >Gautham.

Rene Herman

unread,
Nov 5, 2006, 7:37:22 PM11/5/06
to H. Peter Anvin
H. Peter Anvin wrote:

> Rene Herman wrote:

>> For primary (and extended) partitions, yes. I haven't used any version
>> of DOS that has ever objected to arbitrarily aligned partitions in the
>> MBR (and I do align them arbitrarily since I always make my partitions
>> some exact size and start the next partition in the next sector).
>>
>> Different though for logical partitions inside an extended. As late as
>> Windows 98, DOS would object to non-aligned logicals, at the very
>> least with some settings for the BIOS use/don't use LBA or "Large"
>> settings.
>>
>> Linux doesn't care; I've used type 0x85 instead of 0x05 for my
>> extended partitions dus to that for years. DOS just ignores that one...
>>
>
> DOS, or FDISK?

DOS. It was something like DOS accepting the non cylinder-aligned
logical but then proceding as if it were cylinder aligned anyway,
rounding the starting sector down. This obviously is not good.

Also see the "--DOS-extended" comment in the sfdisk man page. Since I do
remember differences when using different "Large" CHSs in the BIOS, for
all I remember I was experiencing that problem when I ran into it. If
so, the DOS underlying Windows 98 is among the "some versions".

In any case, yes, non-cylinder aligned logical partitions (for whichever
defintion of "aligned" fits DOS' idea of the geometry) really do cause
trouble.

The DR-DOS (-> Novel DOS -> Caldera OpenDOS) warning in that manpage
seems to imply that cylinder alignment was a good idea for all
partitions seen by it, and I do remember it being a pain in that regard
as well. Guess it's pretty safe to not care about DR-DOS anymore though.

Rene.

Phillip Susi

unread,
Nov 5, 2006, 9:44:28 PM11/5/06
to Mikulas Patocka
Mikulas Patocka wrote:
> There was discussion about it here some times ago, and I think the
> result was that the IDE bus is reset prior to capacitors discharge and
> total loss of power and disk has enough time to finish a sector --- but
> if you have crap power supply (doesn't signal power loss), crap
> motherboard (doesn't reset bus) or crap disk (doesn't respond to reset),
> it can fail.
>
> BTW. reiserfs and xfs depend on this feature too. ext3 is the only one
> that doesn't.
>
> Mikulas

Yes, if your disk can not atomically commit a single sector, then it is
broken. And ALL filesystems rely on this behavior because they all
expect NOT to have hardware IO read failures of important metadata after
a power failure ( due to the sector ECC failing ).

Al Boldi

unread,
Nov 6, 2006, 12:38:26 PM11/6/06
to linux-...@vger.kernel.org
Albert Cahalan wrote:
> On 11/4/06, Jan-Benedict Glaw <jbg...@lug-owl.de> wrote:
> > Albert Cahalan <acah...@gmail.com> wrote:
> > > BTW, a person with disk recovery experience told me that drives
> > > will sometimes reorder the sectors. Sector 42 becomes sector 7732,
> > > sector 880880 becomes sector 12345, etc. The very best filesystems
> > > can handle with without data loss. (for example, ZFS) Merely great
> > > filesystems will at least recognize that the data has been trashed.
> >
> > Uh? This should be transparent to the host computer, so logical sector
> > numbers won't change.
>
> "should be" does not imply "won't" :-)
>
> On a drive which is capable of remapping sectors, imagine what
> happens if the remapping data itself is corrupted. (the user data
> is perfectly fine and is not being relocated)

I would consider this a defective drive.

> What I mean is that the logical sector numbers not only change,
> but they are the only thing changing. The user data never moves
> to a different physical location, and is never intended to move.
> The user data is perfectly readable. It just appears in the wrong
> place as viewed by the OS.

Just like defective drive electronics; the data is ok, but the electronics
corrupts the I/O.

No FS could help you there, AFAIK.

BTW, why is this thread not on fsdevel?


Thanks!

--
Al

Jörn Engel

unread,
Nov 6, 2006, 4:19:25 PM11/6/06
to Mikulas Patocka
On Sat, 4 November 2006 19:50:01 +0100, Mikulas Patocka wrote:
>
> LFS fragments data by design ... it can't write to already allocated
> space, so if you write to the middle of LFS directory, it will allocate
> new block, new indirect pointers to that directory, new block in inode
> table etc.

Based on the assumption that reads don't matter, which proved wrong.
Yes.

Your allocation strategy sounds fairly good. I'm not a benchmark
person, so I can only tell horrible from decent, not decent from good.
Your benchmarks speak for themselves, so I guess it is more than just
decent.

However, going back to crash counts and transaction counts, I still
don't understand why you don't just have two checkpoints (or any other
objects, if you don't like the name) with a 64bit version number each
that you alternately write for sync(). You seem to have that concept
for managing free space, just not to manage valid data. What is the
difference?

Jörn

--
Courage is not the absence of fear, but rather the judgement that
something else is more important than fear.
-- Ambrose Redmoon

0 new messages