Journaling UFS with gjournal.

Pawel Jakub Dawidek

unread,

Jun 19, 2006, 9:14:58 AM6/19/06

to freebsd...@freebsd.org, freeb...@freebsd.org, freebs...@freebsd.org

Hello.

For the last few months I have been working on gjournal project.
To stop confusion right here, I want to note, that this project is not
related to gjournal project on which Ivan Voras was working on the
last SoC (2005).

The lack of journaled file system in FreeBSD was a tendon of achilles
for many years. We do have many file systems, but none with journaling:
- ext2fs (journaling is in ext3fs),
- XFS (read-only),
- ReiserFS (read-only),
- HFS+ (read-write, but without journaling),
- NTFS (read-only).

GJournal was designed to journal GEOM providers, so it actually works
below file system layer, but it has hooks which allow to work with
file systems. In other words, gjournal is not file system-depended,
it can work probably with any file system with minimum knowledge
about it. I implemented only UFS support.

The patches are here:

http://people.freebsd.org/~pjd/patches/gjournal.patch (for HEAD)
http://people.freebsd.org/~pjd/patches/gjournal6.patch (for RELENG_6)

To patch your sources you need to:

# cd /usr/src
# mkdir sbin/geom/class/journal sys/geom/journal sys/modules/geom/geom_journal
# patch < /path/to/gjournal.patch

Add 'options UFS_GJOURNAL' to your kernel configuration file and
recompile kernel and world.

How it works (in short). You may define one or two providers which
gjournal will use. If one provider is given, it will be used for both -
data and journal. If two providers are given, one will be used for data
and one for journal.
Every few seconds (you may define how many) journal is terminated and
marked as consistent and gjournal starts to copy data from it to the
data provider. In the same time new data are stored in new journal.
Let's call the moment in which journal is terminated as "journal switch".
Journal switch looks as follows:
1. Start journal switch if we have timeout or if we run out of cache.
Don't perform journal switch if there were no write requests.
2. If we have file system, synchronize it.
3. Mark file system as clean.
4. Block all write requests to the file system.
5. Terminate the journal.
6. Eventually wait if copying of the previous journal is not yet
finished.
7. Send BIO_FLUSH request (if the given provider supports it).
8. Mark new journal position on the journal provider.
9. Unblock write requests.
10. Start copying data from the terminated journal to the data provider.

There were few things I needed to implement outside gjournal to make it
work reliable:

- The BIO_FLUSH request. Currently we have three I/O requests: BIO_READ,
BIO_WRITE and BIO_DELETE. I added BIO_FLUSH, which means "flush your
write cache". The request is send always with the biggest bio_offset set
(mediasize of the destination provider), so it will work properly with
bioq_disksort(). The caller need to stop further I/O requests before
BIO_FLUSH return, so we don't have starvation effect.
The hard part is that is has to be implemented in every disk driver,
because flushing the cache is driver-depended operation. I implemented
it for ata(4) disks and amr(4). The good news is that it's easy.
GJournal can also work with providers that don't support BIO_FLUSH and
in my power-failure tests it worked well (no problems), but it depend
on fact, that gjournal cache is bigger than the controller cache, so it
is hard to call it reliable.
You can read in documentation to many journaled file systems, that you
should turn off write cache if you want to use it. This is not the case
for gjournal (especially when your disk driver does support BIO_FLUSH).

The 'gjournal' mount option. To implement gjournal support in UFS I
needed to change the way of how deleted, but still open objects are
handled. Currently when file or directory is open and we deleted last
name which reference it, it will still be usable by those who keep it
open. When the last consumer closes it, the inode and blocks are freed.
On journal switch I cannot leave such objects, because after a crash
fsck(8) is not used to check the file system, so inode and blocks will
never be freed. When file system is mounted with 'gjournal' mount
option, such objects are not removed when they are open. When last
name is deleted, the file/directory is moved to the .deleted/
directory and removed from there on last close.
This way, I can just clean the .deleted/ directory after a crash at
mount time.

Quick start:

# gjournal label /dev/ad0
# gjournal load
# newfs /dev/ad0.journal
# mount -o async,gjournal /dev/ad0.journal /mnt
(yes, with gjournal 'async' is safe)

Now, after a power failure or system crash no fsck is needed (yay!).

There are two hacks in the current implementation, which I'd like to
reimplement. First is how 'gjournal' mount option is implemented.
There is a garbage collector thread which is responsible for deleting
objects from .deleted/ directory and it is using full paths. Because
of this when your mount point is /foo/bar/baz and you rename 'bar' to
something else, it will not work. This is not what is often done, but
definitely should be fixed and I'm working on it. The second hack is
related to communication between gjournal and file system. GJournal
decides when to make the switch and has to find file system which is
mounted on it. Looking for this file system is not nice and should be
reimplemented.

There are some additional goods which came with gjournal. For example
if gjournal is configured over gmirror or graid3, even on power failure
or system crash, there is no need to synchronize mirror/raid3 device,
because data will be consistent.

I spend a lot of time working on gjournal optimization. Because I've
few seconds before the data hit the data provider I can perform things
like combining smaller write requests into larger once, ignoring data
written twice to the same place, etc.
Because of this, operations on small files are quite fast. On the other
hand, operations on large files are slower, because I need to write the
data twice and there is no place for optimization. Here are some numbers.
gjournal(1) - the data provider and the journal provider on the same disk
gjournal(2) - the data provider and the journal provider on separate
disks

Copying one large file:
UFS: 8s
UFS+SU: 8s
gjournal(1): 16s
gjournal(2): 14s

Copying eight large files in parallel:
UFS: 120s
UFS+SU: 120s
gjournal(1): 184s
gjournal(2): 165s

Untaring eight src.tgz in parallel:
UFS: 791s
UFS+SU: 650s
gjournal(1): 333s
gjournal(2): 309s

Reading. grep -r on two src/ directories in parallel:
UFS: 84s
UFS+SU: 138s
gjournal(1): 102s
gjournal(2): 89s

As you can see, even on one disk, untaring eight src.tgz is two times
faster than UFS+SU. I've no idea why gjournal is faster in reading.

There are a bunch of sysctls to tune gjournal (kern.geom.journal tree).

When only one provider is given for both data and journal, the journal
part is placed at the end of the provider, so one can use file system
without journaling. If you use such configuration (one disk), it is
better for performance to place journal before data, so you may want to
create two partitions (eg. 2GB for ad0a and the rest for ad0d) and
create gjournal this way:

# gjournal label ad0d ad0a

Enjoy!

The work was sponsored by home.pl (http://home.pl).

The work was made by Wheel LTD (http://www.wheel.pl).
The work was tested in the netperf cluster.

I want to thank Alexander Kabaev (kan@) for the help with VFS and
Mike Tancsa for test hardware.

--
Pawel Jakub Dawidek http://www.wheel.pl
p...@FreeBSD.org http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!

Niki Denev

unread,

Jun 19, 2006, 2:33:36 PM6/19/06

to Pawel Jakub Dawidek, freeb...@freebsd.org, freebsd...@freebsd.org, freebs...@freebsd.org

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Wow, this looks pretty cool!

I wonder if it's possible to use gjournal on
existing file system with the journal on a vnode/(swap?) backed md(4) device?
(i want to test on a existing installation without free unpartitioned space)

And if it is possible, how can i do this for the root filesystem? i'll need the md(4)
device before mounting of the root fs which seems hard/impossible?
What's going to happen if my root mount is gjournal labeled and has gjournal option in
fstab but at boot time the journal GEOM provider does not exist?

Thanks for the great work!
When finished, this will certainly make FreeBSD much more competitive :)

- --niki
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFElu2yHNAJ/fLbfrkRAsVBAKChRFMVLuivXYR1NM3b0u9iVe72uwCfdzH0
DvdjEZwOKjuZu4UV+toVpwo=
=+qj/
-----END PGP SIGNATURE-----

_______________________________________________
freebsd...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-curre...@freebsd.org"

Brooks Davis

unread,

Jun 19, 2006, 3:02:06 PM6/19/06

to Pawel Jakub Dawidek, freeb...@freebsd.org, freebsd...@freebsd.org, freebs...@freebsd.org

On Mon, Jun 19, 2006 at 03:11:01PM +0200, Pawel Jakub Dawidek wrote:
>
> How it works (in short). You may define one or two providers which
> gjournal will use. If one provider is given, it will be used for both -
> data and journal. If two providers are given, one will be used for data
> and one for journal.
> Every few seconds (you may define how many) journal is terminated and
> marked as consistent and gjournal starts to copy data from it to the
> data provider. In the same time new data are stored in new journal.
> Let's call the moment in which journal is terminated as "journal switch".

Cool solution! I think I'll give this a try on my redundent mirror
server at work. I'd be curious to see how gjournal performs with the
journal on a battery backed ram disk like the gigabyte i-RAM:

http://www.giga-byte.com/Products/Storage/Products_Overview.aspx?ProductID=2180&ProductName=GC-RAMDISK

It seems like that could reduce or eliminate many of the performance
issues in practice.

-- Brooks

Craig Rodrigues

unread,

Jun 20, 2006, 2:37:01 AM6/20/06

to Pawel Jakub Dawidek, freeb...@freebsd.org, freebsd...@freebsd.org, freebs...@freebsd.org

On Mon, Jun 19, 2006 at 03:11:01PM +0200, Pawel Jakub Dawidek wrote:

> http://people.freebsd.org/~pjd/patches/gjournal.patch (for HEAD)
> http://people.freebsd.org/~pjd/patches/gjournal6.patch (for RELENG_6)

I would recommend that you not introduce a new MNT_GJOURNAL flag to
<sys/mount.h>,
and that instead you just pass -o gjournal directly down into
nmount(). In kernel code, you can use vfs_flagopt()/vfs_getopt()
to determine if you have this mount option or not.
The mount(8) userland utility would not need any modifications,
since it just passes -o options down to nmount().

gjournal looks very interesting!
--
Craig Rodrigues
rod...@crodrigues.org

Pawel Jakub Dawidek

unread,

Jun 20, 2006, 4:40:11 AM6/20/06

to Niki Denev, freeb...@freebsd.org, freebsd...@freebsd.org, freebs...@freebsd.org

On Mon, Jun 19, 2006 at 09:32:18PM +0300, Niki Denev wrote:
+> I wonder if it's possible to use gjournal on
+> existing file system with the journal on a vnode/(swap?) backed md(4) device?
+> (i want to test on a existing installation without free unpartitioned space)

Depend on what do you want to test. If you just want to look around,
swap-backed md(4) device for journal should be fine.
If you want to perform some crash tests, you may want to turn off the
swap and use its provider for journal directly (without md(4)), so it
will be available after a reboot.

You can configure gjournal on an existing file system, but, as always,
the last sector will be used for metadata.
For example, you have your file system on ad0s1d and swap on ad0s1b.
You can try to configure gjournal this way:

# swapoff /dev/ad0s1b
# umount /dev/ad0s1d
# gjournal label ad0s1d ad0s1b

Your swap should have at least 2GB if your file system will be heavy
loaded. Be warned that this will overwrite the last sector on ad0s1d,
which should be safe, but you never know.

+> And if it is possible, how can i do this for the root filesystem? i'll need the md(4)
+> device before mounting of the root fs which seems hard/impossible?
+> What's going to happen if my root mount is gjournal labeled and has gjournal option in
+> fstab but at boot time the journal GEOM provider does not exist?

I forgot to mention this in my initial mail. This is not yet possible to
use gjournal for the root file system.

Pawel Jakub Dawidek

unread,

Jun 20, 2006, 4:44:53 AM6/20/06

to Brooks Davis, freeb...@freebsd.org, freebsd...@freebsd.org, freebs...@freebsd.org

On Mon, Jun 19, 2006 at 11:58:00AM -0700, Brooks Davis wrote:
+> On Mon, Jun 19, 2006 at 03:11:01PM +0200, Pawel Jakub Dawidek wrote:
+> >
+> > How it works (in short). You may define one or two providers which
+> > gjournal will use. If one provider is given, it will be used for both -
+> > data and journal. If two providers are given, one will be used for data
+> > and one for journal.
+> > Every few seconds (you may define how many) journal is terminated and
+> > marked as consistent and gjournal starts to copy data from it to the
+> > data provider. In the same time new data are stored in new journal.
+> > Let's call the moment in which journal is terminated as "journal switch".
+>
+> Cool solution! I think I'll give this a try on my redundent mirror
+> server at work. I'd be curious to see how gjournal performs with the
+> journal on a battery backed ram disk like the gigabyte i-RAM:
+>
+> http://www.giga-byte.com/Products/Storage/Products_Overview.aspx?ProductID=2180&ProductName=GC-RAMDISK

I am curious too:) But as I said, there is still a lot of room for
performance improvements. The bottleneck currently is file system
synchronization, I think.
I hope our VFS gurus will look into VFS_SYNC() optimizations.

Alex Dupre

unread,

Jun 20, 2006, 5:09:54 AM6/20/06

to Pawel Jakub Dawidek, freeb...@freebsd.org, freebsd...@freebsd.org, freebs...@freebsd.org

Pawel Jakub Dawidek wrote:
> I forgot to mention this in my initial mail. This is not yet possible to
> use gjournal for the root file system.

Even if the machine boots from another device and the gjournal kernel
module is loaded before mounting the root filesystem?

--
Alex Dupre

Pawel Jakub Dawidek

unread,

Jun 20, 2006, 5:19:42 AM6/20/06

to Alex Dupre, freeb...@freebsd.org, freebsd...@freebsd.org, freebs...@freebsd.org

On Tue, Jun 20, 2006 at 11:05:31AM +0200, Alex Dupre wrote:
+> Pawel Jakub Dawidek wrote:
+> >I forgot to mention this in my initial mail. This is not yet possible to
+> >use gjournal for the root file system.
+>
+> Even if the machine boots from another device and the gjournal kernel module is loaded before mounting the root filesystem?

Yes, even then, because mount(8) utility is responsible for cleaning
.deleted/ directory. This can be done when the file system is remounted
read-write, but I just didn't have time to work on this yet.

Phil Regnauld

unread,

Jun 20, 2006, 6:30:32 AM6/20/06

to Pawel Jakub Dawidek, freeb...@freebsd.org, freebsd...@freebsd.org, freebs...@freebsd.org

On Mon, Jun 19, 2006 at 03:11:01PM +0200, Pawel Jakub Dawidek wrote:
>

> Copying one large file:
> UFS: 8s
> UFS+SU: 8s
> gjournal(1): 16s
> gjournal(2): 14s

This is very very interesting work!

I am definitely going to test this.

I know this is too early to ask considering the optimizations
that can be done, but do you have any idea how this would perform
compared to ReiserFS on similar operations as the ones you
benchmarked ?

PS: is it me or is the patch missing a gjournal command, as invoked
in your examples ?

Cheers,
Phil

Mike Jakubik

unread,

Jun 20, 2006, 3:21:31 PM6/20/06

to Pawel Jakub Dawidek, freeb...@freebsd.org, freebsd...@freebsd.org, freebs...@freebsd.org

Pawel Jakub Dawidek wrote:
> Copying one large file:
> UFS: 8s
> UFS+SU: 8s
> gjournal(1): 16s
> gjournal(2): 14s
>
> Copying eight large files in parallel:
> UFS: 120s
> UFS+SU: 120s
> gjournal(1): 184s
> gjournal(2): 165s
>
> Untaring eight src.tgz in parallel:
> UFS: 791s
> UFS+SU: 650s
> gjournal(1): 333s
> gjournal(2): 309s
>
> Reading. grep -r on two src/ directories in parallel:
> UFS: 84s
> UFS+SU: 138s
> gjournal(1): 102s
> gjournal(2): 89s
>

Not to sound ungrateful for the work, which i am, this is great! But the
performance impact seems rather large to me. Does the presence of
journaling mean that we could perhaps mount the filesystems async? Does
it eliminate the need for softupdates?

Pawel Jakub Dawidek

unread,

Jun 20, 2006, 3:41:01 PM6/20/06

to Mike Jakubik, freeb...@freebsd.org, freebsd...@freebsd.org, freebs...@freebsd.org

On Tue, Jun 20, 2006 at 03:20:49PM -0400, Mike Jakubik wrote:
+> Pawel Jakub Dawidek wrote:

+> >Copying one large file:
+> >UFS: 8s
+> >UFS+SU: 8s
+> >gjournal(1): 16s
+> >gjournal(2): 14s
+> >
+> >Copying eight large files in parallel:
+> >UFS: 120s
+> >UFS+SU: 120s
+> >gjournal(1): 184s
+> >gjournal(2): 165s
+> >
+> >Untaring eight src.tgz in parallel:
+> >UFS: 791s
+> >UFS+SU: 650s
+> >gjournal(1): 333s
+> >gjournal(2): 309s
+> >
+> >Reading. grep -r on two src/ directories in parallel:
+> >UFS: 84s
+> >UFS+SU: 138s
+> >gjournal(1): 102s
+> >gjournal(2): 89s
+> >
+>
+> Not to sound ungrateful for the work, which i am, this is great! But the performance impact seems rather large to me. Does the presence of journaling mean that we could
+> perhaps mount the filesystems async? Does it eliminate the need for softupdates?

The performance impact is big for large files, because in theory we have
to write the data twice.
Yes, it eliminates need for SU, but there are reasons, that you still
want to use SU, eg. for snapshots.

Xin LI

unread,

Jun 20, 2006, 4:01:56 PM6/20/06

to Pawel Jakub Dawidek, freeb...@freebsd.org, Mike Jakubik, freebsd...@freebsd.org, freebs...@freebsd.org

在 2006-06-20二的 21:36 +0200，Pawel Jakub Dawidek写道：

> The performance impact is big for large files, because in theory we have
> to write the data twice.
> Yes, it eliminates need for SU, but there are reasons, that you still
> want to use SU, eg. for snapshots.

Em... IIRC SU and snapshots are independent, no?

Cheers,
--
Xin LI <delphij delphij net> http://www.delphij.net/

signature.asc

Mike Jakubik

unread,

Jun 20, 2006, 4:11:23 PM6/20/06

to Xin LI, freeb...@freebsd.org, freebsd...@freebsd.org, Pawel Jakub Dawidek, freebs...@freebsd.org

Xin LI wrote:
> 在 2006-06-20二的 21:36 +0200，Pawel Jakub
Dawidek写道：
>
>> The performance impact is big for large files, because in theory we ha
ve
>> to write the data twice.
>> Yes, it eliminates need for SU, but there are reasons, that you still
>> want to use SU, eg. for snapshots.
>>
>
> Em... IIRC SU and snapshots are independent, no?
>
> Cheers,
>

What about mounting the filesystem async though? It was my understanding
that the Linux filesystems were much faster in benchmarks because they
were mounted async by default, however the presence of journaling
allowed this safely. Is this the case here too?

Bakul Shah

unread,

Jun 20, 2006, 4:30:25 PM6/20/06

to Pawel Jakub Dawidek, freeb...@freebsd.org, freebsd...@freebsd.org, freebs...@freebsd.org

This is great! We have sorely needed this for quite a while
what with terabyte size filesystems getting into common use.

> How it works (in short). You may define one or two providers which
> gjournal will use. If one provider is given, it will be used for both -
> data and journal. If two providers are given, one will be used for data
> and one for journal.
> Every few seconds (you may define how many) journal is terminated and
> marked as consistent and gjournal starts to copy data from it to the
> data provider. In the same time new data are stored in new journal.

Some random comments:

Would it make sense to treat the journal as a circular
buffer? Then commit to the underlying provider starts when
the buffer has $hiwater blocks or the upper layer wants to
sync. The commit stops when the buffer has $lowater blocks
or in case of sync the buffer is empty. This will allow
parallel writes to the provider and the journal, thereby
reducing latency.

I don't understand why you need FS synchronization. Once the
journal is written, the data is safe. A "redo" may be needed
after a crash to sync the filesystem but that is about it.
Redo should be idempotent. Each journal write block may need
some flags. For instance mark a block as a "sync point" --
when this block is on the disk, the FS will be in a
consistent state. In case of redo after crash you have to
throw away all the journal blocks after the last sync point.

It seems to me if you write a serial number with each data
block, in the worst case redo has to do a binary search to
find the first block to write but normal writes to journal
and reads from journal (for commiting to the provider) can be
completely sequential. Since redo will be much much faster
than fsck you can afford to slow it down a bit if the normal
case can be speeded up.

Presumably you disallow opening any file in /.deleted.

Can you gjournal the journal disk? Recursion is good:-)

-- bakul

Scott Long

unread,

Jun 20, 2006, 4:33:42 PM6/20/06

to Mike Jakubik, freeb...@freebsd.org, Pawel Jakub Dawidek, freebsd...@freebsd.org, Xin LI, freebs...@freebsd.org

Mike Jakubik wrote:
> Xin LI wrote:
>
>> 在 2006-06-20二的 21:36 +0200，Pawel Jakub
Dawidek写道：
>>
>>
>>> The performance impact is big for large files, because in theory we h
ave
>>> to write the data twice.
>>> Yes, it eliminates need for SU, but there are reasons, that you still
>>> want to use SU, eg. for snapshots.
>>>
>>
>>
>> Em... IIRC SU and snapshots are independent, no?
>>
>> Cheers,
>>
>
>
> What about mounting the filesystem async though? It was my understandin
g
> that the Linux filesystems were much faster in benchmarks because they
> were mounted async by default, however the presence of journaling
> allowed this safely. Is this the case here too?
>

Yes, async mounting is much faster that sync mounting, and slightly
faster than SU, except when SU is dealing with huge data sets. Then
async is significantly faster.

Scott

Ulrich Spoerlein

unread,

Jun 20, 2006, 4:51:39 PM6/20/06

to Pawel Jakub Dawidek, freeb...@freebsd.org, freebsd...@freebsd.org, freebs...@freebsd.org

Pawel Jakub Dawidek wrote:
> Hello.
>
> For the last few months I have been working on gjournal project.

Cool Stuff!

> Reading. grep -r on two src/ directories in parallel:
> UFS: 84s
> UFS+SU: 138s
> gjournal(1): 102s
> gjournal(2): 89s
>
> As you can see, even on one disk, untaring eight src.tgz is two times
> faster than UFS+SU. I've no idea why gjournal is faster in reading.

The UFS+SU score doesn't seem right. Why do SU have a negative impact on
read performance? Is it solely because of the atime updates?

Ulrich Spoerlein
--
PGP Key ID: 20FEE9DD Encrypted mail welcome!
Fingerprint: AEC9 AF5E 01AC 4EE1 8F70 6CBD E76E 2227 20FE E9DD
Which is worse: ignorance or apathy?
Don't know. Don't care.

Pawel Jakub Dawidek

unread,

Jun 20, 2006, 4:53:46 PM6/20/06

to Phil Regnauld, freeb...@freebsd.org, freebsd...@freebsd.org, freebs...@freebsd.org

On Tue, Jun 20, 2006 at 12:29:10PM +0200, Phil Regnauld wrote:
+> On Mon, Jun 19, 2006 at 03:11:01PM +0200, Pawel Jakub Dawidek wrote:
+> >

+> > Copying one large file:
+> > UFS: 8s
+> > UFS+SU: 8s
+> > gjournal(1): 16s
+> > gjournal(2): 14s
+>

+> This is very very interesting work!
+>
+> I am definitely going to test this.
+>
+> I know this is too early to ask considering the optimizations
+> that can be done, but do you have any idea how this would perform
+> compared to ReiserFS on similar operations as the ones you
+> benchmarked ?

No idea. I think ReiserFS is using only metadata journaling, but I don't
know for sure.

+> PS: is it me or is the patch missing a gjournal command, as invoked
+> in your examples ?

It is you:) gjournal(8) is implemented as shared library for geom(8)
command, just like gconcat(8), gstripe(8), gmirror(8), graid3(8),
geli(8), gnop(8), glabel(8) and gshsec(8).

Bruce Evans

unread,

Jun 21, 2006, 12:06:31 PM6/21/06

to Ulrich Spoerlein, freeb...@freebsd.org, freebsd...@freebsd.org, Pawel Jakub Dawidek, freebs...@freebsd.org

On Tue, 20 Jun 2006, Ulrich Spoerlein wrote:

> Pawel Jakub Dawidek wrote:
>> Hello.
>>
>> For the last few months I have been working on gjournal project.
>
> Cool Stuff!
>
>> Reading. grep -r on two src/ directories in parallel:
>> UFS: 84s
>> UFS+SU: 138s
>> gjournal(1): 102s
>> gjournal(2): 89s
>>
>> As you can see, even on one disk, untaring eight src.tgz is two times
>> faster than UFS+SU. I've no idea why gjournal is faster in reading.
>
> The UFS+SU score doesn't seem right. Why do SU have a negative impact on
> read performance? Is it solely because of the atime updates?

ffs+SU is only 1-10% slower than ffs in my benchmarks of reading back
a copy of most of src/ written to a new file system by the same
filesystem (code) that does the readback. The speed only depends on
which file system wrote the data. I use tar for reading. Maybe
concurrent greps on separate directories amplify the problem.

A tiny subset of saved benchmarked output:
%%%
Jan 29 2004
real-current writing to WD 1200JB h: 26683965 73593765
---
srcs = "contrib crypto lib sys" in /usr/src
ffs-16384-02048-1:
tarcp /f srcs: 43.23 real 0.65 user 6.85 sys
tar cf /dev/zero srcs: 15.58 real 0.19 user 2.13 sys
ffs-16384-02048-2:
tarcp /f srcs: 41.26 real 0.50 user 7.06 sys
tar cf /dev/zero srcs: 15.80 real 0.25 user 2.10 sys
ffs-16384-02048-as-1:
tarcp /f srcs: 22.17 real 0.49 user 6.47 sys
tar cf /dev/zero srcs: 15.52 real 0.22 user 2.13 sys
ffs-16384-02048-as-2:
tarcp /f srcs: 21.67 real 0.45 user 6.61 sys
tar cf /dev/zero srcs: 15.65 real 0.19 user 2.16 sys
ffs-16384-02048-su-1:
tarcp /f srcs: 60.35 real 0.49 user 7.02 sys
tar cf /dev/zero srcs: 17.32 real 0.20 user 2.15 sys
ffs-16384-02048-su-2:
tarcp /f srcs: 61.82 real 0.50 user 7.14 sys
tar cf /dev/zero srcs: 17.56 real 0.21 user 2.17 sys
%%%

Notation: 16384-02048 is the block-frag size;
/""/as/su/ are /default/async mounts/soft updates/; -[12] is ffs[12].
The source tree is prefetched into VMIO so that the copy part of the
benchmark is mostly a write benchmark and is not affected by any slowness
in the physical source file system.

The above shows soft updates being about 2 seconds or 10% slower for
read-back. It also shows that soft updates is about 3 times as slow
as async mounts and about 1.5 times as slow as the default (sync
metadata and async data). Soft updates was faster than the default
when it was first implemented, but became slower at least for writing
a copy of src/. This seems to be due to soft updates interacting
badly with bufdaemon. This may be fixed now (I have later runs of the
benchmark showing soft updates having about the same speed as the
default, but none for -realcurrent).

I never found the exact cause of the slower readback. My theory is
that block allocation is more delayed in the soft updates case, and
soft updates uses this to perfectly pessimize some aspects of the
allocation. My version of ffs allocates the first indirect block
between the NDADDR-1'th and NDADDR'th data blocks. This seems to
help generally, and reduces the disadvantage of soft updates. IIRC,
the default normally puts this block not very far away but not
necessarily between the data blocks, but soft updates pessimizes it
by moving it a long way away. It's still surprising that this makes
nearly a 10% difference for src/, since most files in src/ are too
small to have even 1 indirect block.

I always disable atime updates on ffs file systems and don't have
comparitive benchmarks for the difference from this.

Bruce

Pawel Jakub Dawidek

unread,

Jun 22, 2006, 5:51:51 AM6/22/06

to freebsd...@freebsd.org, freeb...@freebsd.org, freebs...@freebsd.org

On Tue, Jun 20, 2006 at 07:33:39PM +0200, Ulrich Spoerlein wrote:
+> Pawel Jakub Dawidek wrote:
+> > Reading. grep -r on two src/ directories in parallel:
+> > UFS: 84s
+> > UFS+SU: 138s
+> > gjournal(1): 102s
+> > gjournal(2): 89s
+> >

+> > As you can see, even on one disk, untaring eight src.tgz is two times
+> > faster than UFS+SU. I've no idea why gjournal is faster in reading.
+>
+> The UFS+SU score doesn't seem right. Why do SU have a negative impact on
+> read performance? Is it solely because of the atime updates?

As I said, I've not idea. You may simply ignore my benchmarks and try
them on your own:)

Pawel Jakub Dawidek

unread,

Jun 22, 2006, 5:56:05 AM6/22/06

to Xin LI, freeb...@freebsd.org, Mike Jakubik, freebsd...@freebsd.org, freebs...@freebsd.org

On Wed, Jun 21, 2006 at 03:59:46AM +0800, Xin LI wrote:
+> ??? 2006-06-20?????? 21:36 +0200???Pawel Jakub Dawidek?????????
+> > The performance impact is big for large files, because in theory we have
+> > to write the data twice.
+> > Yes, it eliminates need for SU, but there are reasons, that you still
+> > want to use SU, eg. for snapshots.
+>
+> Em... IIRC SU and snapshots are independent, no?

Oops. Yes, you are right. bgfsck depends on SU, not snapshots.

Pawel Jakub Dawidek

unread,

Jun 22, 2006, 6:10:07 AM6/22/06

to Bakul Shah, freeb...@freebsd.org, freebsd...@freebsd.org, freebs...@freebsd.org

On Tue, Jun 20, 2006 at 01:29:48PM -0700, Bakul Shah wrote:
+> This is great! We have sorely needed this for quite a while
+> what with terabyte size filesystems getting into common use.

+>
+> > How it works (in short). You may define one or two providers which
+> > gjournal will use. If one provider is given, it will be used for both -
+> > data and journal. If two providers are given, one will be used for data
+> > and one for journal.

+> > Every few seconds (you may define how many) journal is terminated and
+> > marked as consistent and gjournal starts to copy data from it to the
+> > data provider. In the same time new data are stored in new journal.
+>
+> Some random comments:
+>
+> Would it make sense to treat the journal as a circular
+> buffer? Then commit to the underlying provider starts when
+> the buffer has $hiwater blocks or the upper layer wants to
+> sync. The commit stops when the buffer has $lowater blocks
+> or in case of sync the buffer is empty. This will allow
+> parallel writes to the provider and the journal, thereby
+> reducing latency.

This is bascially what is done now.
There are always two journal - active and inactive.
New data are written to the active journal. When journal switch time
arrives (timeout occurs or cache is full), the active journal is
terminated and new active journal is started right after this one.
The previous active journal becomes inactive and the data is copied to
the destination (data) provider in parallel to new requests which are
stored in the active journal.
Writes are suspended only on synchronize file system and terminate the
active journal. Copying data from the inactive journal is done in
parallel to normal operations.

+> I don't understand why you need FS synchronization. Once the
+> journal is written, the data is safe. [...]

Which data? When you for example delete a file, you need to perform
those operations:
- remove name from a directory
- mark inode as free
- mark blocks as free

Synchronizing file system gives me certainty that all those operations
reached gjournal, so I can safely mark file system as clean and
terminate the journal.

Alexandr Kovalenko

unread,

Jun 23, 2006, 4:25:56 AM6/23/06

to Pawel Jakub Dawidek, freeb...@freebsd.org, freebsd...@freebsd.org, freebs...@freebsd.org

Hello, Pawel Jakub Dawidek!

On Mon, Jun 19, 2006 at 03:11:01PM +0200, you wrote:

> For the last few months I have been working on gjournal project.
> To stop confusion right here, I want to note, that this project is not
> related to gjournal project on which Ivan Voras was working on the
> last SoC (2005).

[dd]

> Quick start:
>
> # gjournal label /dev/ad0
> # gjournal load
> # newfs /dev/ad0.journal
> # mount -o async,gjournal /dev/ad0.journal /mnt
> (yes, with gjournal 'async' is safe)
>
> Now, after a power failure or system crash no fsck is needed (yay!).

Is it safe to do so on existing filesystem (if I'm using 2nd partition for
journal)?

i.e.:

$ grep ad0s1f /etc/fstab
/dev/ad0s1f /usr ufs rw,noatime 2 2

$ grep ad0s1b /etc/fstab
#/dev/ad0s1b none swap sw 0 0

--
NEVE-RIPE, will build world for food
Ukrainian FreeBSD User Group
http://uafug.org.ua/

Alexandr Kovalenko

unread,

Jun 23, 2006, 4:54:15 AM6/23/06

to R. B. Riddick, freeb...@freebsd.org, freebsd...@freebsd.org, freebs...@freebsd.org

Hello, R. B. Riddick!

On Fri, Jun 23, 2006 at 01:38:38AM -0700, you wrote:

> --- Alexandr Kovalenko <ne...@nevermind.kiev.ua> wrote:
> > Is it safe to do so on existing filesystem (if I'm using 2nd partition for
> > journal)?
> >

> Hmm...
>
> Depends:
> If your existing file system needs its last sector, then it wont work. If it
> does not need it, then it might work (although fsck does not check for a
> raw-device shrinkage - I think)...
>
> I say, can you make the size of ad0s1f one sector bigger with bsdlabel(8)
> without changing the start sector?
> I mean: Is there at least one free sector after ad0s1f?

Unfortunately - no :(

R. B. Riddick

unread,

Jun 23, 2006, 7:31:41 AM6/23/06

to Alexandr Kovalenko, freeb...@freebsd.org, freebsd...@freebsd.org, freebs...@freebsd.org

--- Alexandr Kovalenko <ne...@nevermind.kiev.ua> wrote:

> Is it safe to do so on existing filesystem (if I'm using 2nd partition
for
> journal)?
>

Hmm...

Depends:
If your existing file system needs its last sector, then it wont work. If
it
does not need it, then it might work (although fsck does not check for a
raw-device shrinkage - I think)...

I say, can you make the size of ad0s1f one sector bigger with bsdlabel(8)
without changing the start sector?
I mean: Is there at least one free sector after ad0s1f?

-Arne

__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com

Eric Anderson

unread,

Jun 23, 2006, 11:23:08 AM6/23/06

to Pawel Jakub Dawidek, freeb...@freebsd.org, freebsd...@freebsd.org, freebs...@freebsd.org

I'm not sure this is happening the way you describe exactly. On my
laptop, while rsyncing my /home partition to a newly created external
disk (400G), I see 20MB/s writing to the journaled UFS2 device
(/dev/label/backup.journal) passing through to the journal device
(/dev/label/journal), then it switches to no writes to the journaled
UFS2 device (/dev/label/backup.journal) (my rsync pauses) while the
journaled device (/dev/label/backup) writes at 20MB/s for about 3-10
seconds.

> Let's call the moment in which journal is terminated as "journal switch".
> Journal switch looks as follows:
> 1. Start journal switch if we have timeout or if we run out of cache.
> Don't perform journal switch if there were no write requests.
> 2. If we have file system, synchronize it.
> 3. Mark file system as clean.
> 4. Block all write requests to the file system.
> 5. Terminate the journal.
> 6. Eventually wait if copying of the previous journal is not yet
> finished.

Seems like this is the point we are busy in.

> 7. Send BIO_FLUSH request (if the given provider supports it).
> 8. Mark new journal position on the journal provider.
> 9. Unblock write requests.
> 10. Start copying data from the terminated journal to the data provider.

And it seems that 10 is happening earlier on..

Is this all expected behaviour?

Thanks for the great work, and fantastic GEOM tools!

Eric

--
------------------------------------------------------------------------
Eric Anderson Sr. Systems Administrator Centaur Technology
Anything that works is better than anything that doesn't.
------------------------------------------------------------------------

Oliver Fromme

unread,

Jun 23, 2006, 11:57:19 AM6/23/06

to freebsd...@freebsd.org, p...@freebsd.org

R. B. Riddick <arne_w...@yahoo.com> wrote:
> Alexandr Kovalenko <ne...@nevermind.kiev.ua> wrote:
> > Is it safe to do so on existing filesystem (if I'm using 2nd partition for
> > journal)?
>

> Depends:
> If your existing file system needs its last sector, then it wont work. If it
> does not need it, then it might work (although fsck does not check for a
> raw-device shrinkage - I think)...

It has no way to check it. If the last sector of the
partition happens to be part of file data, overwriting
it with gjournal meta data will lead to a corrupted
file, and fsck(8) has no way to notice that, of course.
If that sector happens to contain UFS meta data, fsck(8)
might detect the corruption and try to correct it, which
will destroy the gjournal meta data. I guess that both
cases are very, very bad. :-)

It's not difficult to check if the last sector is in use
or not. Just repeat the newfs(8) with the -N flag, so
it prints out the values without doing anything (you can
even do this as normal user, not root). For example:

$ bsdlabel /dev/ad0s1 | grep a:
a: 488397105 0 4.2BSD 2048 16384 106 # (Cyl. 0 - 484520*)
$ newfs -N /dev/ad0s1a
Warning: Block size and bytes per inode restrict cylinders per group to 89.
Warning: 1744 sector(s) in last cylinder unallocated
/dev/ad0s1a: 488397104 sectors in 119238 cylinders of 1 tracks, 4096 sectors
238475.1MB in 1340 cyl groups (89 c/g, 178.00MB/g, 22528 i/g)

In that case, the last sector is not used by the file
system. (Of course, if you created the FS with special
options, e.g. different cylinder group size, you must
specify those options here, too, or you might get wrong
output.)

FreeBSD does have growfs(8), but unfortunately it still
doesn't have shrinkfs(8), which other operating systems
have (e.g. Solaris). It might be a nice project for a
junior FS hacker ... ;-)

Best regards
Oliver

--
Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing
Dienstleistungen mit Schwerpunkt FreeBSD: http://www.secnetix.de/bsd
Any opinions expressed in this message may be personal to the author
and may not necessarily reflect the opinions of secnetix in any way.

"... there are two ways of constructing a software design: One way
is to make it so simple that there are _obviously_ no deficiencies and
the other way is to make it so complicated that there are no _obvious_
deficiencies." -- C.A.R. Hoare, ACM Turing Award Lecture, 1980

Dmitry Pryanishnikov

unread,

Jun 23, 2006, 3:23:23 PM6/23/06

to freebsd...@freebsd.org, p...@freebsd.org

Hello!

On Fri, 23 Jun 2006, Oliver Fromme wrote:
> > If your existing file system needs its last sector, then it wont work. If it
> > does not need it, then it might work (although fsck does not check for a
> > raw-device shrinkage - I think)...
>
> It has no way to check it. If the last sector of the
> partition happens to be part of file data, overwriting
> it with gjournal meta data will lead to a corrupted
> file, and fsck(8) has no way to notice that, of course.

It seems to me that badsect(8) is the way to go. Just try to declare the last
sector as bad. fsck then (after marking and unmounting) will tell you whether
this sector is used in another file (if so, you could just copy relevant data
and delete the file while keeping just created BAD/nnnnn file covering the
last sector). badsect+fsck will do all consistency checks for you.

Sincerely, Dmitry
--
Atlantis ISP, System Administrator
e-mail: dmi...@atlantis.dp.ua
nic-hdl: LYNX-RIPE

Pawel Jakub Dawidek

unread,

Jun 23, 2006, 3:43:02 PM6/23/06

to Eric Anderson, freeb...@freebsd.org, freebsd...@freebsd.org, freebs...@freebsd.org

On Fri, Jun 23, 2006 at 10:20:38AM -0500, Eric Anderson wrote:
+> Pawel Jakub Dawidek wrote:

+> >Hello.
+> >For the last few months I have been working on gjournal project.
+> >To stop confusion right here, I want to note, that this project is not
+> >related to gjournal project on which Ivan Voras was working on the
+> >last SoC (2005).
+> >The lack of journaled file system in FreeBSD was a tendon of achilles
+> >for many years. We do have many file systems, but none with journaling:
+> >- ext2fs (journaling is in ext3fs),
+> >- XFS (read-only),
+> >- ReiserFS (read-only),
+> >- HFS+ (read-write, but without journaling),
+> >- NTFS (read-only).
+> >GJournal was designed to journal GEOM providers, so it actually works
+> >below file system layer, but it has hooks which allow to work with
+> >file systems. In other words, gjournal is not file system-depended,
+> >it can work probably with any file system with minimum knowledge
+> >about it. I implemented only UFS support.
+> >The patches are here:
+> > http://people.freebsd.org/~pjd/patches/gjournal.patch (for HEAD)
+> > http://people.freebsd.org/~pjd/patches/gjournal6.patch (for RELENG_6)
+> >To patch your sources you need to:
+> > # cd /usr/src
+> > # mkdir sbin/geom/class/journal sys/geom/journal sys/modules/geom/geom_journal
+> > # patch < /path/to/gjournal.patch
+> >Add 'options UFS_GJOURNAL' to your kernel configuration file and
+> >recompile kernel and world.

+> >How it works (in short). You may define one or two providers which
+> >gjournal will use. If one provider is given, it will be used for both -
+> >data and journal. If two providers are given, one will be used for data
+> >and one for journal.

+> >Every few seconds (you may define how many) journal is terminated and
+> >marked as consistent and gjournal starts to copy data from it to the
+> >data provider. In the same time new data are stored in new journal.
+>
+> I'm not sure this is happening the way you describe exactly. On my laptop, while rsyncing my /home partition to a newly created external disk (400G), I see 20MB/s writing
+> to the journaled UFS2 device (/dev/label/backup.journal) passing through to the journal device (/dev/label/journal), then it switches to no writes to the journaled UFS2
+> device (/dev/label/backup.journal) (my rsync pauses) while the journaled device (/dev/label/backup) writes at 20MB/s for about 3-10 seconds.

When it is time for journal switch, we cannot switch the journals if we
still copy data from the inactive journal, so we wait then.
You can tune it a bit using those two sysctls:

kern.geom.journal.parallel_flushes - Number of flush I/O requests send
in parallel
kern.geom.journal.parallel_copies - Number of copy I/O requests send
in parallel

By default those are equal, you may increase the second one or decrease
the first one to tell gjournal to focus more on copying the data from
the inactive journal, so when journal switch time arrives, it doesn't
have to wait.
Before you do it, please consult kern.geom.journal.stats.wait_for_copy
sysctl variable, which will tell you how many times journal switch was
delayed because of inactive journal not beeing fully copied.

More waiting is because a lot of data is only in memory and when I call
file system synchronization all the data go to gjournal provider.

All modes in which UFS can operate are not optimal for gjournal - I mean
here sync, async and SU. The most optimal mode for gjournal will be
something like: send write request immediatelly and don't wait for an
answer. GJournal will take care of reordering write request to get
optimal throughput and this will allow for more balanced load.
For example SU send write requests in picks, which is bad for gjournal.

+> >Let's call the moment in which journal is terminated as "journal switch".
+> >Journal switch looks as follows:
+> >1. Start journal switch if we have timeout or if we run out of cache.
+> > Don't perform journal switch if there were no write requests.
+> >2. If we have file system, synchronize it.
+> >3. Mark file system as clean.
+> >4. Block all write requests to the file system.
+> >5. Terminate the journal.
+> >6. Eventually wait if copying of the previous journal is not yet
+> > finished.
+>
+> Seems like this is the point we are busy in.
+>
+> >7. Send BIO_FLUSH request (if the given provider supports it).
+> >8. Mark new journal position on the journal provider.
+> >9. Unblock write requests.
+> >10. Start copying data from the terminated journal to the data provider.
+>
+> And it seems that 10 is happening earlier on..

The point number 10 is actually after the journal switch. It is when the
active journal was turned into an inactive journal and the copy starts.

Don't take this order to strict, I more wanted to show what steps are
performed.

Peter Jeremy

unread,

Jun 23, 2006, 3:50:27 PM6/23/06

to R. B. Riddick, freeb...@freebsd.org, freebsd...@freebsd.org, freebs...@freebsd.org

On Fri, 2006-Jun-23 01:38:38 -0700, R. B. Riddick wrote:
>--- Alexandr Kovalenko <ne...@nevermind.kiev.ua> wrote:
>> Is it safe to do so on existing filesystem (if I'm using 2nd partition for
>> journal)?

...

>If your existing file system needs its last sector, then it wont work. If it
>does not need it, then it might work (although fsck does not check for a
>raw-device shrinkage - I think)...

In my experience, the last partition in a disk slice normally has an
odd number of sectors and UFS definitely can't handle anything smaller
than a fragment (which defaults to 2K) - and I suspect that UFS can't
handle a trailing fragment. In this case, the last sector is
definitely unused.

I may be wrong but I don't think it's possible for the last sector of
a partition to be FS metadata because the metadata is always at the
beginning of a CG and newfs won't create a CG unless there's some
space for data in the CG. If there are an integral number of
fragments (or maybe blocks), then marking the last fragment as 'bad'
would seem an adequate solution - the FS will ignore that block but
anything below the filesystem won't see the "bad block" marker.

--
Peter Jeremy

Message has been deleted

Tarasov Alexey

unread,

Jun 26, 2006, 1:43:16 PM6/26/06

to cur...@freebsd.org, Pawel Jakub Dawidek

Hello!

GJournal is very interesting. Are you going to make a installation image
(like image with DTrace) to test it as root filesystem?

Thanks.

Best regards,
Tarasov Alexey.