Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Re: ZFS committed to the FreeBSD base.

33 views
Skip to first unread message

Kris Kennaway

unread,
Apr 5, 2007, 11:07:50 PM4/5/07
to
On Fri, Apr 06, 2007 at 04:57:00AM +0200, Pawel Jakub Dawidek wrote:
> Hi.
>
> I'm happy to inform that the ZFS file system is now part of the FreeBSD
> operating system. ZFS is available in the HEAD branch and will be
> available in FreeBSD 7.0-RELEASE as an experimental feature.
>
> Commit log:
>
> Please welcome ZFS - The last word in file systems.
>
> ZFS file system was ported from OpenSolaris operating system. The code
> in under CDDL license.
>
> I'd like to thank all SUN developers that created this great piece of
> software.
>
> Supported by: Wheel LTD (http://www.wheel.pl/)
> Supported by: The FreeBSD Foundation (http://www.freebsdfoundation.org/)
> Supported by: Sentex (http://www.sentex.net/)
>
> Limitations.
>
> Currently ZFS is only compiled as kernel module and is only available
> for i386 architecture. Amd64 should be available very soon, the other
> archs will come later, as we implement needed atomic operations.
>
> Missing functionality.
>
> - We don't have iSCSI target daemon in the tree, so sharing ZVOLs via
> iSCSI is also not supported at this point. This should be fixed in
> the future, we may also add support for sharing ZVOLs over ggate.
> - There is no support for ACLs and extended attributes.
> - There is no support for booting off of ZFS file system.
>
> Other than that, ZFS should be fully-functional.
>
> Enjoy!

Give yourself a pat on the back :)

Kris

Sean Bryant

unread,
Apr 5, 2007, 11:21:08 PM4/5/07
to
Is it fully 128bit? From wikipedia, which is by no means an
authoritative source but I have no idea if this was ever an issue.
_______________________________________________
freebsd...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-curre...@freebsd.org"

Juha Saarinen

unread,
Apr 5, 2007, 11:42:51 PM4/5/07
to
On 4/6/07, Kris Kennaway <kr...@obsecurity.org> wrote:
> > Please welcome ZFS - The last word in file systems.

> Give yourself a pat on the back :)

Seconded.


--
Juha
http://www.geekzone.co.nz/juha

Eric Anderson

unread,
Apr 6, 2007, 1:22:14 AM4/6/07
to
On 04/05/07 21:57, Pawel Jakub Dawidek wrote:
> Hi.
>
> I'm happy to inform that the ZFS file system is now part of the FreeBSD
> operating system. ZFS is available in the HEAD branch and will be
> available in FreeBSD 7.0-RELEASE as an experimental feature.


Pawel - you're a madman! :)

I'm afraid of what your next project will be.

Thanks for the solid work (again..),
Eric

Alex Dupre

unread,
Apr 6, 2007, 3:26:34 AM4/6/07
to
Pawel Jakub Dawidek wrote:
> I'm happy to inform that the ZFS file system is now part of the FreeBSD
> operating system.

Congratulations! You're great!

> - There is no support for booting off of ZFS file system.

Even booting kernel from a removable ufs media and then mounting a zfs
root via vfs.root.mountfrom?

--
Alex Dupre

Ivan Voras

unread,
Apr 6, 2007, 4:36:52 AM4/6/07
to
Sean Bryant wrote:

> Is it fully 128bit? From wikipedia, which is by no means an
> authoritative source but I have no idea if this was ever an issue.

It's 64-bit even in Solaris. The "128-bitness" is only in the storage
format, not for file system ops visible to applications.

(AFAIK).

signature.asc

Robert Watson

unread,
Apr 6, 2007, 5:28:34 AM4/6/07
to

On Fri, 6 Apr 2007, Alex Dupre wrote:

> Pawel Jakub Dawidek wrote:
>> I'm happy to inform that the ZFS file system is now part of the FreeBSD
>> operating system.
>
> Congratulations! You're great!
>
>> - There is no support for booting off of ZFS file system.
>
> Even booting kernel from a removable ufs media and then mounting a zfs root
> via vfs.root.mountfrom?

I believe the key issue here is that the boot loader doesn't yet support ZFS.
In 6.x and 7.x, the mechanism for mounting the root file system is identical
to all other file systems, so it should be possible to use any file system as
the root file system as long as you get can get the kernel up and running.
And, in the case of ZFS, the ZFS module loaded (since it currently must be a
module).

This is really exciting work and I'm very glad to see this in the tree!

Robert N M Watson
Computer Laboratory
University of Cambridge

Pawel Jakub Dawidek

unread,
Apr 6, 2007, 6:40:04 AM4/6/07
to

That's correct. We are limited by POSIX, but the on-disk format is
128bit.

--
Pawel Jakub Dawidek http://www.wheel.pl
p...@FreeBSD.org http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!

Pawel Jakub Dawidek

unread,
Apr 6, 2007, 7:29:11 AM4/6/07
to
On Fri, Apr 06, 2007 at 05:54:37AM +0100, Ricardo Correia wrote:
> I'm interested in the cross-platform portability of ZFS pools, so I have
> one question: did you implement the Solaris ZFS whole-disk support
> (specifically, the creation and recognition of the EFI/GPT label)?
>
> Unfortunately some tools in Linux (parted and cfdisk) have trouble
> recognizing the EFI partition created by ZFS/Solaris..

I'm not yet setup to move disks between FreeBSD and Solaris, but my
first goal was to integrate it with FreeBSD's GEOM framework.

We support cache flushing operations on any GEOM provider (disk,
partition, slice, anything disk-like), so bascially currently I treat
everything as a whole disk (because I simply can), but don't do any
EFI/GPT labeling. I'll try to move data from Solaris' disk to FreeBSD
and see what happen.

Pawel Jakub Dawidek

unread,
Apr 6, 2007, 7:55:57 AM4/6/07
to
On Fri, Apr 06, 2007 at 05:28:34AM -0400, Robert Watson wrote:
>
> On Fri, 6 Apr 2007, Alex Dupre wrote:
>
> >Pawel Jakub Dawidek wrote:
> >>I'm happy to inform that the ZFS file system is now part of the FreeBSD
> >>operating system.
> >
> >Congratulations! You're great!
> >
> >> - There is no support for booting off of ZFS file system.
> >
> >Even booting kernel from a removable ufs media and then mounting a zfs root via vfs.root.mountfrom?
>
> I believe the key issue here is that the boot loader doesn't yet support ZFS. In 6.x and 7.x, the mechanism for mounting the root file system is identical to all other file
> systems, so it should be possible to use any file system as the root file system as long as you get can get the kernel up and running. And, in the case of ZFS, the ZFS
> module loaded (since it currently must be a module).

You are right in general, but it isn't really true for ZFS currently.
There are two very small issues:

1. Prefered way to mount ZFS file system is via 'zfs mount' command, but
it can be mounted using old way as well, so this really shouldn't be an
issue.

2. ZFS kernel module read /etc/zfs/zpool.cache file on load by accessing
it via file system. We would need to change it to load this file via
loader. Shouldn't be hard.

Pawel Jakub Dawidek

unread,
Apr 6, 2007, 8:34:47 AM4/6/07
to
On Fri, Apr 06, 2007 at 01:29:11PM +0200, Pawel Jakub Dawidek wrote:
> On Fri, Apr 06, 2007 at 05:54:37AM +0100, Ricardo Correia wrote:
> > I'm interested in the cross-platform portability of ZFS pools, so I have
> > one question: did you implement the Solaris ZFS whole-disk support
> > (specifically, the creation and recognition of the EFI/GPT label)?
> >
> > Unfortunately some tools in Linux (parted and cfdisk) have trouble
> > recognizing the EFI partition created by ZFS/Solaris..
>
> I'm not yet setup to move disks between FreeBSD and Solaris, but my
> first goal was to integrate it with FreeBSD's GEOM framework.
>
> We support cache flushing operations on any GEOM provider (disk,
> partition, slice, anything disk-like), so bascially currently I treat
> everything as a whole disk (because I simply can), but don't do any
> EFI/GPT labeling. I'll try to move data from Solaris' disk to FreeBSD
> and see what happen.

First try:

GEOM: ad6: corrupt or invalid GPT detected.
GEOM: ad6: GPT rejected -- may not be recoverable.

:)

Roman Divacky

unread,
Apr 6, 2007, 11:17:53 AM4/6/07
to
On Fri, Apr 06, 2007 at 04:57:00AM +0200, Pawel Jakub Dawidek wrote:
> Hi.

>
> I'm happy to inform that the ZFS file system is now part of the FreeBSD
> operating system. ZFS is available in the HEAD branch and will be
> available in FreeBSD 7.0-RELEASE as an experimental feature.
>
> Commit log:

>
> Please welcome ZFS - The last word in file systems.

this is incredibly great! thnx..

do you have any benchmark numbers? I saw some in your *con paper
but since then we got new sx locks and you did some performance
improvements as well..

I am just curious :)

thnx again!

roman

Sean Bryant

unread,
Apr 6, 2007, 12:14:07 PM4/6/07
to
Pawel Jakub Dawidek wrote:
> On Fri, Apr 06, 2007 at 10:36:52AM +0200, Ivan Voras wrote:
>
>> Sean Bryant wrote:
>>
>>
>>> Is it fully 128bit? From wikipedia, which is by no means an authoritative source but I have no idea if this was ever an issue.
>>>
>> It's 64-bit even in Solaris. The "128-bitness" is only in the storage format, not for file system ops visible to applications.
>>
>> (AFAIK).
>>
>
> That's correct. We are limited by POSIX, but the on-disk format is
> 128bit.
>
>
Thanks for the update,
I'll probably update that Wikipedia entry to reflect recent changes and
more correctly state the limitations.

Johan Hendriks

unread,
Apr 6, 2007, 2:37:34 PM4/6/07
to

Great stuff.

Does it also needs mentioning in /boot/defaults/loader.conf to load the zfs module.

regards,
Johan

Bruce M. Simpson

unread,
Apr 6, 2007, 5:09:56 PM4/6/07
to
This is most excellent work which is going to help everyone in a very
big way. Many thanks for working on this.

Ceri Davies

unread,
Apr 6, 2007, 5:07:06 PM4/6/07
to
On Thu, Apr 05, 2007 at 09:58:47PM -0700, Rich Teer wrote:
> > I'm happy to inform that the ZFS file system is now part of the FreeBSD
> > operating system. ZFS is available in the HEAD branch and will be
> > available in FreeBSD 7.0-RELEASE as an experimental feature.
>
> This is fantastic news! At the risk of raking over ye olde arguments,
> as the old saying goes: "Dual licensing? We don't need no stinkeen
> dual licensing!". :-)

Actually, you might want to run that statement by a certain John Birrell
(j...@FreeBSD.org) regarding the DTrace port and see what answer you get.

Ceri
--
That must be wonderful! I don't understand it at all.
-- Moliere

Pawel Jakub Dawidek

unread,
Apr 6, 2007, 5:48:04 PM4/6/07
to
On Fri, Apr 06, 2007 at 09:26:34AM +0200, Alex Dupre wrote:
> Pawel Jakub Dawidek wrote:
> > I'm happy to inform that the ZFS file system is now part of the FreeBSD
> > operating system.
>
> Congratulations! You're great!
>
> > - There is no support for booting off of ZFS file system.
>
> Even booting kernel from a removable ufs media and then mounting a zfs
> root via vfs.root.mountfrom?

I just verified that this will be possible:

# mount
tank on / (zfs, local)
devfs on /dev (devfs, local)

but I need some time to implement it right.

Gabor Kovesdan

unread,
Apr 6, 2007, 5:52:07 PM4/6/07
to
Ceri Davies schrieb:

> On Thu, Apr 05, 2007 at 09:58:47PM -0700, Rich Teer wrote:
>
>>> I'm happy to inform that the ZFS file system is now part of the FreeBSD
>>> operating system. ZFS is available in the HEAD branch and will be
>>> available in FreeBSD 7.0-RELEASE as an experimental feature.
>>>
>> This is fantastic news! At the risk of raking over ye olde arguments,
>> as the old saying goes: "Dual licensing? We don't need no stinkeen
>> dual licensing!". :-)
>>
>
> Actually, you might want to run that statement by a certain John Birrell
> (j...@FreeBSD.org) regarding the DTrace port and see what answer you get.
>
>
jhb@ is John Baldwin, John Birrel is jb@! :)

Regards,
Gabor

Bruno Damour

unread,
Apr 6, 2007, 7:06:26 PM4/6/07
to
Thanks, fantasticly interesting !

> Currently ZFS is only compiled as kernel module and is only available
> for i386 architecture. Amd64 should be available very soon, the other
> archs will come later, as we implement needed atomic operations.
>
I'm waiting eagerly to amd64 version....

> Missing functionality.
>
> - There is no support for ACLs and extended attributes.
>
Is this planned ? Does that means I cannot use it as a basis for a
full-featured samba share ?

Thanks for your great work !!

Bruno DAMOUR

Pawel Jakub Dawidek

unread,
Apr 6, 2007, 8:57:05 PM4/6/07
to
On Sat, Apr 07, 2007 at 12:39:14AM +0200, Bruno Damour wrote:
> Thanks, fantasticly interesting !
> > Currently ZFS is only compiled as kernel module and is only available
> > for i386 architecture. Amd64 should be available very soon, the other
> > archs will come later, as we implement needed atomic operations.
> >
> I'm waiting eagerly to amd64 version....
>
> >Missing functionality.
> >
> > - There is no support for ACLs and extended attributes.
> >
> Is this planned ? Does that means I cannot use it as a basis for a full-featured samba share ?

It is planned, but it's not trivial. Does samba support NFSv4-style
ACLs?

Bernd Walter

unread,
Apr 6, 2007, 10:56:45 PM4/6/07
to
On Fri, Apr 06, 2007 at 04:57:00AM +0200, Pawel Jakub Dawidek wrote:
> Hi.
>
> I'm happy to inform that the ZFS file system is now part of the FreeBSD
> operating system. ZFS is available in the HEAD branch and will be
> available in FreeBSD 7.0-RELEASE as an experimental feature.

I got a kmem panic just by copying a recent ports.tgz (36M) onto a ZFS.
My sandbox just has 128MB RAM so kmem was set to ~40M.
After raising kmem to 80M it survived copying the file, but paniced
again while tar -xvzf the file into the same pool.
vfs.zfs.vdev.cache.size is unchanged at 10M.

--
B.Walter http://www.bwct.de http://www.fizon.de
be...@bwct.de in...@bwct.de sup...@fizon.de

Alex Dupre

unread,
Apr 7, 2007, 3:39:44 AM4/7/07
to
Pawel Jakub Dawidek wrote:
> I just verified that this will be possible:
>
> # mount
> tank on / (zfs, local)
> devfs on /dev (devfs, local)
>
> but I need some time to implement it right.

I waited months for current ZFS implementation, I can wait more for root
support, now that I know it'll be possible :-) Thanks again.

Randall Stewart

unread,
Apr 7, 2007, 6:39:22 AM4/7/07
to
Great work Pawel...

I see you posted a quick start ... I will have
to move my laptop to use this as its non-root fs's :-D

R

Pawel Jakub Dawidek wrote:
> Hi.
>
> I'm happy to inform that the ZFS file system is now part of the FreeBSD
> operating system. ZFS is available in the HEAD branch and will be
> available in FreeBSD 7.0-RELEASE as an experimental feature.
>

> Commit log:
>
> Please welcome ZFS - The last word in file systems.
>

> ZFS file system was ported from OpenSolaris operating system. The code
> in under CDDL license.
>
> I'd like to thank all SUN developers that created this great piece of
> software.
>
> Supported by: Wheel LTD (http://www.wheel.pl/)
> Supported by: The FreeBSD Foundation (http://www.freebsdfoundation.org/)
> Supported by: Sentex (http://www.sentex.net/)
>
> Limitations.
>

> Currently ZFS is only compiled as kernel module and is only available
> for i386 architecture. Amd64 should be available very soon, the other
> archs will come later, as we implement needed atomic operations.
>

> Missing functionality.
>
> - We don't have iSCSI target daemon in the tree, so sharing ZVOLs via
> iSCSI is also not supported at this point. This should be fixed in
> the future, we may also add support for sharing ZVOLs over ggate.

> - There is no support for ACLs and extended attributes.

> - There is no support for booting off of ZFS file system.
>

> Other than that, ZFS should be fully-functional.
>
> Enjoy!
>


--
Randall Stewart
NSSTG - Cisco Systems Inc.
803-345-0369 <or> 803-317-4952 (cell)

Jorn Argelo

unread,
Apr 7, 2007, 6:54:57 AM4/7/07
to
Rich Teer wrote:
>> I'm happy to inform that the ZFS file system is now part of the FreeBSD
>> operating system. ZFS is available in the HEAD branch and will be
>> available in FreeBSD 7.0-RELEASE as an experimental feature.
>>
>
> This is fantastic news! At the risk of raking over ye olde arguments,
> as the old saying goes: "Dual licensing? We don't need no stinkeen
> dual licensing!". :-)
>
>
First of all, thanks a lot for all the hard work of both the FreeBSD
developers as the ZFS developers. I can't wait to give it a go.

That leads me to one question though: Why is *BSD able to bring it into
the OS as where Linux has licensing problems with the CDDL? AFAIK Linux
users can only run it in userland mode and not in kernel mode because of
the licenses.

I don't really know the differences between all the licenses, so feel
free to correct me if I'm saying something stupid.

Thanks,

Jorn

Florent Thoumie

unread,
Apr 7, 2007, 8:15:00 AM4/7/07
to
Pawel Jakub Dawidek wrote:
> Hi.
>
> I'm happy to inform that the ZFS file system is now part of the FreeBSD
> operating system. ZFS is available in the HEAD branch and will be
> available in FreeBSD 7.0-RELEASE as an experimental feature.

Thanks for working on it Pawel!

We're now all waiting for 7.0-RELEASE :-)

--
Florent Thoumie
f...@FreeBSD.org
FreeBSD Committer

signature.asc

Pawel Jakub Dawidek

unread,
Apr 7, 2007, 9:13:53 AM4/7/07
to
On Sat, Apr 07, 2007 at 04:56:45AM +0200, Bernd Walter wrote:
> On Fri, Apr 06, 2007 at 04:57:00AM +0200, Pawel Jakub Dawidek wrote:
> > Hi.
> >
> > I'm happy to inform that the ZFS file system is now part of the FreeBSD
> > operating system. ZFS is available in the HEAD branch and will be
> > available in FreeBSD 7.0-RELEASE as an experimental feature.
>
> I got a kmem panic just by copying a recent ports.tgz (36M) onto a ZFS.
> My sandbox just has 128MB RAM so kmem was set to ~40M.
> After raising kmem to 80M it survived copying the file, but paniced
> again while tar -xvzf the file into the same pool.
> vfs.zfs.vdev.cache.size is unchanged at 10M.

128MB RAM of suggested minimum in ZFS requirements, but it may be not
enough... Minimum of ARC is set to 1/8 of all memory or 64MB (whichever
is more). Could you locate these lines in
sys/contrib/opensolaris/uts/common/fs/zfs/arc.c file:

/* set min cache to 1/32 of all memory, or 64MB, whichever is more */
arc_c_min = MAX(arc_c / 4, 64<<20);

Change 64 to eg. 32, recompile and retest?

Wilko Bulte

unread,
Apr 7, 2007, 10:17:37 AM4/7/07
to
On Sat, Apr 07, 2007 at 12:54:57PM +0200, Jorn Argelo wrote..

> Rich Teer wrote:
> >>I'm happy to inform that the ZFS file system is now part of the FreeBSD
> >>operating system. ZFS is available in the HEAD branch and will be
> >>available in FreeBSD 7.0-RELEASE as an experimental feature.
> >>
> >
> >This is fantastic news! At the risk of raking over ye olde arguments,
> >as the old saying goes: "Dual licensing? We don't need no stinkeen
> >dual licensing!". :-)
> >
> >
> First of all, thanks a lot for all the hard work of both the FreeBSD
> developers as the ZFS developers. I can't wait to give it a go.
>
> That leads me to one question though: Why is *BSD able to bring it into
> the OS as where Linux has licensing problems with the CDDL? AFAIK Linux
> users can only run it in userland mode and not in kernel mode because of
> the licenses.

My guess(!) is that they do not want non-GPL-ed code in the standard kernel.

--
Wilko Bulte wi...@FreeBSD.org

Florian C. Smeets

unread,
Apr 7, 2007, 9:59:02 AM4/7/07
to
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Pawel Jakub Dawidek wrote:
> On Sat, Apr 07, 2007 at 04:56:45AM +0200, Bernd Walter wrote:
>> My sandbox just has 128MB RAM so kmem was set to ~40M.
>> After raising kmem to 80M it survived copying the file, but paniced
>> again while tar -xvzf the file into the same pool.
>> vfs.zfs.vdev.cache.size is unchanged at 10M.
>
> 128MB RAM of suggested minimum in ZFS requirements, but it may be not
> enough... Minimum of ARC is set to 1/8 of all memory or 64MB (whichever
> is more). Could you locate these lines in
> sys/contrib/opensolaris/uts/common/fs/zfs/arc.c file:
>
> /* set min cache to 1/32 of all memory, or 64MB, whichever is more */
> arc_c_min = MAX(arc_c / 4, 64<<20);
>
> Change 64 to eg. 32, recompile and retest?
>

Hi Pawel,

i had the same problems like Bernd while trying to copy the src tree to
a ZFS volume. I have 384MB RAM but i got the same "kmem_map: too small"
panic. I compiled my kernel like you proposed and now i am able to copy
anything to the volume without panic :-)

Regards,
Florian

P.S. Thanks for all the work on ZFS!
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (Darwin)

iD8DBQFGF6OmA+1tjUZ1YScRAnSMAJ4y27u0nGu9L4RgDBclxKh5q6Z/RgCgjbi7
1Ri2CZfH8YKqj8Bdmx7bedM=
=PUsh
-----END PGP SIGNATURE-----

Bernd Walter

unread,
Apr 7, 2007, 2:03:19 PM4/7/07
to
On Sat, Apr 07, 2007 at 06:58:00PM +0200, Bernd Walter wrote:

> On Sat, Apr 07, 2007 at 03:59:02PM +0200, Florian C. Smeets wrote:
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> >
> > Pawel Jakub Dawidek wrote:
> > > On Sat, Apr 07, 2007 at 04:56:45AM +0200, Bernd Walter wrote:
> > >> My sandbox just has 128MB RAM so kmem was set to ~40M.
> > >> After raising kmem to 80M it survived copying the file, but paniced
> > >> again while tar -xvzf the file into the same pool.
> > >> vfs.zfs.vdev.cache.size is unchanged at 10M.
> > >
> > > 128MB RAM of suggested minimum in ZFS requirements, but it may be not
> > > enough... Minimum of ARC is set to 1/8 of all memory or 64MB (whichever
> > > is more). Could you locate these lines in
> > > sys/contrib/opensolaris/uts/common/fs/zfs/arc.c file:
> > >
> > > /* set min cache to 1/32 of all memory, or 64MB, whichever is more */
> > > arc_c_min = MAX(arc_c / 4, 64<<20);
> > >
> > > Change 64 to eg. 32, recompile and retest?
> > >
> >
> > Hi Pawel,
> >
> > i had the same problems like Bernd while trying to copy the src tree to
> > a ZFS volume. I have 384MB RAM but i got the same "kmem_map: too small"
> > panic. I compiled my kernel like you proposed and now i am able to copy
> > anything to the volume without panic :-)
>
> I had increased RAM to 384 and still had a panic with default kmem
> (IIRC around 100M) and even increasing kmem to 160M did help a long
> time, but still produced the panic after a while.
> I don't think 64M applies here as the real limit.

Now with 240M kmem it looks good, but I'm still unshure:
kstat.zfs.misc.arcstats.c_min: 67108864
kstat.zfs.misc.arcstats.c_max: 188743680
kstat.zfs.misc.arcstats.size: 87653376
c_max seemed to be increasing with kmem, but I did compare it with a
remebered value.
Should be good with:
vm.kmem_size: 251658240
But top shows wired memory which is roughly twice the size of
arcstats.size, so I'm still worried about kmem exhaustion if ARC runs
up to c_max.
Since the c_min/c_max values also influence the available RAM for other
purposes as well, can we have it at least a loader.conf tuneable?

Otherwise - the reboot after the panics where impressive.
No long fsck times or noticed data corruption - even with NFS clients.
All in all it is a great job.

Pawel Jakub Dawidek

unread,
Apr 7, 2007, 3:15:17 PM4/7/07
to
On Sat, Apr 07, 2007 at 08:03:19PM +0200, Bernd Walter wrote:
> Now with 240M kmem it looks good, but I'm still unshure:
> kstat.zfs.misc.arcstats.c_min: 67108864
> kstat.zfs.misc.arcstats.c_max: 188743680
> kstat.zfs.misc.arcstats.size: 87653376
> c_max seemed to be increasing with kmem, but I did compare it with a
> remebered value.
> Should be good with:
> vm.kmem_size: 251658240
> But top shows wired memory which is roughly twice the size of
> arcstats.size, so I'm still worried about kmem exhaustion if ARC runs
> up to c_max.
> Since the c_min/c_max values also influence the available RAM for other
> purposes as well, can we have it at least a loader.conf tuneable?

Just committed a change. You can tune max and min ARC size via
vfs.zfs.arc_max and vfs.zfs.arc_min tunnables.

Dag-Erling Smørgrav

unread,
Apr 7, 2007, 3:43:59 PM4/7/07
to
Pawel Jakub Dawidek <p...@FreeBSD.org> writes:
> Limitations.
>
> Currently ZFS is only compiled as kernel module and is only available
> for i386 architecture. Amd64 should be available very soon, the other
> archs will come later, as we implement needed atomic operations.

ZFS is now also available on pc98 and amd64.

DES
--
Dag-Erling Smørgrav - d...@des.no

Bernd Walter

unread,
Apr 7, 2007, 4:34:12 PM4/7/07
to
On Sat, Apr 07, 2007 at 09:43:59PM +0200, Dag-Erling Smørgrav wrote:
> Pawel Jakub Dawidek <p...@FreeBSD.org> writes:
> > Limitations.
> >
> > Currently ZFS is only compiled as kernel module and is only available
> > for i386 architecture. Amd64 should be available very soon, the other
> > archs will come later, as we implement needed atomic operations.
>
> ZFS is now also available on pc98 and amd64.

Great to read - is it just atomic.S missing for the remaining
architectures?

Dag-Erling Smørgrav

unread,
Apr 7, 2007, 5:16:12 PM4/7/07
to
Bernd Walter <ti...@cicely12.cicely.de> writes:
> On Sat, Apr 07, 2007 at 09:43:59PM +0200, Dag-Erling Smørgrav wrote:
> > ZFS is now also available on pc98 and amd64.
> Great to read - is it just atomic.S missing for the remaining
> architectures?

Yes. Ideally, ZFS would use FreeBSD's atomic operations instead of
its own. I believe that the reason it doesn't is (at least in part)
that we don't have 64-bit atomic operations for i386. I have
unfinished patches for cleaning up the atomic operations on all
platforms; I'll dust them off and see what I can do.

DES
--
Dag-Erling Smørgrav - d...@des.no

Bernd Walter

unread,
Apr 7, 2007, 5:24:14 PM4/7/07
to
On Sat, Apr 07, 2007 at 09:15:17PM +0200, Pawel Jakub Dawidek wrote:
> On Sat, Apr 07, 2007 at 08:03:19PM +0200, Bernd Walter wrote:
> > Now with 240M kmem it looks good, but I'm still unshure:
> > kstat.zfs.misc.arcstats.c_min: 67108864
> > kstat.zfs.misc.arcstats.c_max: 188743680
> > kstat.zfs.misc.arcstats.size: 87653376
> > c_max seemed to be increasing with kmem, but I did compare it with a
> > remebered value.
> > Should be good with:
> > vm.kmem_size: 251658240
> > But top shows wired memory which is roughly twice the size of
> > arcstats.size, so I'm still worried about kmem exhaustion if ARC runs
> > up to c_max.
> > Since the c_min/c_max values also influence the available RAM for other
> > purposes as well, can we have it at least a loader.conf tuneable?
>
> Just committed a change. You can tune max and min ARC size via
> vfs.zfs.arc_max and vfs.zfs.arc_min tunnables.

Thanks - I'd set c_max to 80M now and will see what happens, since
I had such a panic again with 240M kmem.

I'm a bit confused about the calculation as such.
Lets asume a 4G i386 system.
arg_c = 512M
c_min = 512M
c_max = 3G
But isn't this KVA space, of which we usually can't have 3G on i386
without limiting userland to 1G?
Even 512M KVA sounds very much on a i386, since 4G systems usually
have more use for limited KVA.

Bruno Damour

unread,
Apr 8, 2007, 2:03:11 AM4/8/07
to
hello,

After csup, buildworld fails for me in libumem.
Is this due to zfs import ?
Or my config ?

Thanks for any clue, i'm dying to try your brand new zfs on amd64 !!

Bruno

FreeBSD vil1.ruomad.net 7.0-CURRENT FreeBSD 7.0-CURRENT #0: Fri Mar 23
07:33:56 CET 2007 ro...@vil1.ruomad.net:/usr/obj/usr/src/sys/VIL1 amd64

make buildworld:

===> cddl/lib/libumem (all)
cc -O2 -fno-strict-aliasing -pipe -march=nocona
-I/usr/src/cddl/lib/libumem/../../../compat/opensolaris/lib/libumem
-D_SOLARIS_C_SOURCE -c /usr/src/cddl/lib/libumem/umem.c
/usr/src/cddl/lib/libumem/umem.c:197: error: redefinition of 'nofail_cb'
/usr/src/cddl/lib/libumem/umem.c:30: error: previous definition of
'nofail_cb' was here
/usr/src/cddl/lib/libumem/umem.c:199: error: redefinition of `struct
umem_cache'
/usr/src/cddl/lib/libumem/umem.c:210: error: redefinition of 'umem_alloc'
/usr/src/cddl/lib/libumem/umem.c:43: error: previous definition of
'umem_alloc' was here
/usr/src/cddl/lib/libumem/umem.c:233: error: redefinition of 'umem_zalloc'
/usr/src/cddl/lib/libumem/umem.c:66: error: previous definition of
'umem_zalloc' was here
/usr/src/cddl/lib/libumem/umem.c:256: error: redefinition of 'umem_free'
/usr/src/cddl/lib/libumem/umem.c:89: error: previous definition of
'umem_free' was here
/usr/src/cddl/lib/libumem/umem.c:264: error: redefinition of
'umem_nofail_callback'
/usr/src/cddl/lib/libumem/umem.c:97: error: previous definition of
'umem_nofail_callback' was here
/usr/src/cddl/lib/libumem/umem.c:272: error: redefinition of
'umem_cache_create'
/usr/src/cddl/lib/libumem/umem.c:105: error: previous definition of
'umem_cache_create' was here
/usr/src/cddl/lib/libumem/umem.c:291: error: redefinition of
'umem_cache_alloc'
/usr/src/cddl/lib/libumem/umem.c:124: error: previous definition of
'umem_cache_alloc' was here
/usr/src/cddl/lib/libumem/umem.c:321: error: redefinition of
'umem_cache_free'
/usr/src/cddl/lib/libumem/umem.c:154: error: previous definition of
'umem_cache_free' was here
/usr/src/cddl/lib/libumem/umem.c:332: error: redefinition of
'umem_cache_destroy'
/usr/src/cddl/lib/libumem/umem.c:165: error: previous definition of
'umem_cache_destroy' was here
/usr/src/cddl/lib/libumem/umem.c:364: error: redefinition of 'nofail_cb'
/usr/src/cddl/lib/libumem/umem.c:197: error: previous definition of
'nofail_cb' was here
/usr/src/cddl/lib/libumem/umem.c:364: error: redefinition of 'nofail_cb'
/usr/src/cddl/lib/libumem/umem.c:197: error: previous definition of
'nofail_cb' was here
/usr/src/cddl/lib/libumem/umem.c:366: error: redefinition of `struct
umem_cache'
/usr/src/cddl/lib/libumem/umem.c:377: error: redefinition of 'umem_alloc'
/usr/src/cddl/lib/libumem/umem.c:210: error: previous definition of
'umem_alloc' was here
/usr/src/cddl/lib/libumem/umem.c:377: error: redefinition of 'umem_alloc'
/usr/src/cddl/lib/libumem/umem.c:210: error: previous definition of
'umem_alloc' was here
/usr/src/cddl/lib/libumem/umem.c:400: error: redefinition of 'umem_zalloc'
/usr/src/cddl/lib/libumem/umem.c:233: error: previous definition of
'umem_zalloc' was here
/usr/src/cddl/lib/libumem/umem.c:400: error: redefinition of 'umem_zalloc'
/usr/src/cddl/lib/libumem/umem.c:233: error: previous definition of
'umem_zalloc' was here
/usr/src/cddl/lib/libumem/umem.c:423: error: redefinition of 'umem_free'
/usr/src/cddl/lib/libumem/umem.c:256: error: previous definition of
'umem_free' was here
/usr/src/cddl/lib/libumem/umem.c:423: error: redefinition of 'umem_free'
/usr/src/cddl/lib/libumem/umem.c:256: error: previous definition of
'umem_free' was here
/usr/src/cddl/lib/libumem/umem.c:431: error: redefinition of
'umem_nofail_callback'
/usr/src/cddl/lib/libumem/umem.c:264: error: previous definition of
'umem_nofail_callback' was here
/usr/src/cddl/lib/libumem/umem.c:431: error: redefinition of
'umem_nofail_callback'
/usr/src/cddl/lib/libumem/umem.c:264: error: previous definition of
'umem_nofail_callback' was here
/usr/src/cddl/lib/libumem/umem.c:439: error: redefinition of
'umem_cache_create'
/usr/src/cddl/lib/libumem/umem.c:272: error: previous definition of
'umem_cache_create' was here
/usr/src/cddl/lib/libumem/umem.c:439: error: redefinition of
'umem_cache_create'
/usr/src/cddl/lib/libumem/umem.c:272: error: previous definition of
'umem_cache_create' was here
/usr/src/cddl/lib/libumem/umem.c:458: error: redefinition of
'umem_cache_alloc'
/usr/src/cddl/lib/libumem/umem.c:291: error: previous definition of
'umem_cache_alloc' was here
/usr/src/cddl/lib/libumem/umem.c:458: error: redefinition of
'umem_cache_alloc'
/usr/src/cddl/lib/libumem/umem.c:291: error: previous definition of
'umem_cache_alloc' was here
/usr/src/cddl/lib/libumem/umem.c:488: error: redefinition of
'umem_cache_free'
/usr/src/cddl/lib/libumem/umem.c:321: error: previous definition of
'umem_cache_free' was here
/usr/src/cddl/lib/libumem/umem.c:488: error: redefinition of
'umem_cache_free'
/usr/src/cddl/lib/libumem/umem.c:321: error: previous definition of
'umem_cache_free' was here
/usr/src/cddl/lib/libumem/umem.c:499: error: redefinition of
'umem_cache_destroy'
/usr/src/cddl/lib/libumem/umem.c:332: error: previous definition of
'umem_cache_destroy' was here
/usr/src/cddl/lib/libumem/umem.c:499: error: redefinition of
'umem_cache_destroy'
/usr/src/cddl/lib/libumem/umem.c:332: error: previous definition of
'umem_cache_destroy' was here
*** Error code 1

Stop in /usr/src/cddl/lib/libumem.
*** Error code 1

Stop in /usr/src/cddl/lib.
*** Error code 1

Stop in /usr/src.
*** Error code 1

Stop in /usr/src.
*** Error code 1

Stop in /usr/src.
*** Error code 1

Stop in /usr/src.

Pawel Jakub Dawidek

unread,
Apr 8, 2007, 5:49:31 AM4/8/07
to
On Sun, Apr 08, 2007 at 08:03:11AM +0200, Bruno Damour wrote:
> hello,
>
> After csup, buildworld fails for me in libumem.
> Is this due to zfs import ?
> Or my config ?
>
> Thanks for any clue, i'm dying to try your brand new zfs on amd64 !!
>
> Bruno
>
> FreeBSD vil1.ruomad.net 7.0-CURRENT FreeBSD 7.0-CURRENT #0: Fri Mar 23 07:33:56 CET 2007 ro...@vil1.ruomad.net:/usr/obj/usr/src/sys/VIL1 amd64
>
> make buildworld:
>
> ===> cddl/lib/libumem (all)
> cc -O2 -fno-strict-aliasing -pipe -march=nocona -I/usr/src/cddl/lib/libumem/../../../compat/opensolaris/lib/libumem -D_SOLARIS_C_SOURCE -c /usr/src/cddl/lib/libumem/umem.c
> /usr/src/cddl/lib/libumem/umem.c:197: error: redefinition of 'nofail_cb'
> /usr/src/cddl/lib/libumem/umem.c:30: error: previous definition of 'nofail_cb' was here
> /usr/src/cddl/lib/libumem/umem.c:199: error: redefinition of `struct umem_cache'
> /usr/src/cddl/lib/libumem/umem.c:210: error: redefinition of 'umem_alloc'
> /usr/src/cddl/lib/libumem/umem.c:43: error: previous definition of 'umem_alloc' was here

Did you use my previous patches? There is no cddl/lib/libumem/umem.c is
HEAD, it was it's old location and it was moved to
compat/opensolaris/lib/libumem/. Delete your entire cddl/ directory and
recsup.

Max Laier

unread,
Apr 8, 2007, 1:10:36 PM4/8/07
to
On Saturday 07 April 2007 21:43, Dag-Erling Smørgrav wrote:
> Pawel Jakub Dawidek <p...@FreeBSD.org> writes:
> > Limitations.
> >
> > Currently ZFS is only compiled as kernel module and is only
> > available for i386 architecture. Amd64 should be available very soon,
> > the other archs will come later, as we implement needed atomic
> > operations.
>
> ZFS is now also available on pc98 and amd64.

panic: lock "zfs:&zap->zap_f.zap_num_entries_mtx" 0xffffff006582c260
already initialized

While dump/restoreing /usr to zfs. kgdb trace attached. Let me know if
you need further information.

--
/"\ Best regards, | mla...@freebsd.org
\ / Max Laier | ICQ #67774661
X http://pf4freebsd.love2party.net/ | mlaier@EFnet
/ \ ASCII Ribbon Campaign | Against HTML Mail and News

log.dump_panic

Max Laier

unread,
Apr 8, 2007, 2:13:59 PM4/8/07
to
On Sunday 08 April 2007 19:10, Max Laier wrote:
> On Saturday 07 April 2007 21:43, Dag-Erling Smørgrav wrote:
> > Pawel Jakub Dawidek <p...@FreeBSD.org> writes:
> > > Limitations.
> > >
> > > Currently ZFS is only compiled as kernel module and is only
> > > available for i386 architecture. Amd64 should be available very
> > > soon, the other archs will come later, as we implement needed
> > > atomic operations.
> >
> > ZFS is now also available on pc98 and amd64.
>
> panic: lock "zfs:&zap->zap_f.zap_num_entries_mtx" 0xffffff006582c260
> already initialized
>
> While dump/restoreing /usr to zfs. kgdb trace attached. Let me know
> if you need further information.

The attached diff lets me survive the dump/restore. Not sure if this is
the right fix, but seems like the union messes with mutex initialization.

zfs.dump.diff

Dag-Erling Smørgrav

unread,
Apr 8, 2007, 2:20:44 PM4/8/07
to
Max Laier <m...@love2party.net> writes:
> The attached diff lets me survive the dump/restore. Not sure if
> this is the right fix, but seems like the union messes with mutex
> initialization.

You need to track down where memory for the mutex (or rather zap) was
actually allocated, and stick the memset there. I suspect it
originates on the stack somewhere.

Max Laier

unread,
Apr 8, 2007, 2:43:13 PM4/8/07
to
On Sunday 08 April 2007 20:20, Dag-Erling Smørgrav wrote:
> Max Laier <m...@love2party.net> writes:
> > The attached diff lets me survive the dump/restore. Not sure if
> > this is the right fix, but seems like the union messes with mutex
> > initialization.
>
> You need to track down where memory for the mutex (or rather zap) was
> actually allocated, and stick the memset there. I suspect it
> originates on the stack somewhere.

Well, I assume it is zeroed already, but on the way the other union
members are used which messes up the storage for the mutex. At least
looking at the contents gives me that impression:

> $2 = {zap_objset = 0xffffff0001406410, zap_object = 12660, zap_dbuf =
> 0xffffff005ce892d0, zap_rwlock = {lock_object = { lo_name =
> 0xffffffff8081b416 "zfs:&zap->zap_rwlock", lo_type = 0xffffffff8081b416
> "zfs:&zap->zap_rwlock", lo_flags = 41615360, lo_witness_data = {
> lod_list = {stqe_next = 0x0}, lod_witness = 0x0}}, sx_lock =
> 18446742974215086080, sx_recurse = 0}, zap_ismicro = 0, zap_salt =
> 965910969, zap_u = {zap_fat = {zap_phys = 0xffffffff81670000,
> zap_num_entries_mtx = {lock_object = {lo_name = 0x70000 <Address
> 0x70000 out of bounds>, lo_type = 0x0, lo_flags = 2155822976,
> lo_witness_data = {lod_list = {stqe_next = 0x0}, lod_witness = 0x0}},
> sx_lock = 1, sx_recurse = 0}, zap_block_shift = 0}, zap_micro =
> {zap_phys = 0xffffffff81670000, zap_num_entries = 0, zap_num_chunks =
> 7, zap_alloc_next = 0, zap_avl = { avl_root = 0x0, avl_compar =
> 0xffffffff807f3f80 <mze_compare>, avl_offset = 0, avl_numnodes = 1,
> avl_size = 0}}}}

Matthew Dillon

unread,
Apr 8, 2007, 2:38:14 PM4/8/07
to

:Hi.
:
:I'm happy to inform that the ZFS file system is now part of the FreeBSD

:operating system. ZFS is available in the HEAD branch and will be
:available in FreeBSD 7.0-RELEASE as an experimental feature.

Congratulations on your excellent work, Pawel!

-Matt

Pawel Jakub Dawidek

unread,
Apr 8, 2007, 2:53:12 PM4/8/07
to
On Sun, Apr 08, 2007 at 07:10:36PM +0200, Max Laier wrote:

> On Saturday 07 April 2007 21:43, Dag-Erling Sm?rgrav wrote:
> > Pawel Jakub Dawidek <p...@FreeBSD.org> writes:
> > > Limitations.
> > >
> > > Currently ZFS is only compiled as kernel module and is only
> > > available for i386 architecture. Amd64 should be available very soon,
> > > the other archs will come later, as we implement needed atomic
> > > operations.
> >
> > ZFS is now also available on pc98 and amd64.
>
> panic: lock "zfs:&zap->zap_f.zap_num_entries_mtx" 0xffffff006582c260
> already initialized
>
> While dump/restoreing /usr to zfs. kgdb trace attached. Let me know if
> you need further information.
[...]
> #10 0xffffffff80295755 in panic (fmt=0xffffffff80481bc0 "lock \"%s\" %p already initialized") at /usr/src/sys/kern/kern_shutdown.c:547
> #11 0xffffffff802bd72e in lock_init (lock=0x0, class=0xffffffff80a11000, name=0xa <Address 0xa out of bounds>,
> type=0x1b1196 <Address 0x1b1196 out of bounds>, flags=1048064) at /usr/src/sys/kern/subr_lock.c:201
> #12 0xffffffff807f092a in fzap_upgrade (zap=0xffffff006582c200, tx=0xffffff006591dd00)
> at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/zap.c:87
> #13 0xffffffff807f42d3 in mzap_upgrade (zap=0xffffff006582c200, tx=0xffffff006591dd00)
> at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/zap_micro.c:361
> #14 0xffffffff807f4cd4 in zap_add (os=0x0, zapobj=18446744071572623360, name=0xffffff00060ebc19 "org.eclipse.jdt_3.2.1.r321_v20060905-R4CM1Znkvre9wC-",
> integer_size=8, num_integers=1, val=0xffffffffaeeb6860, tx=0xffffff006591dd00)
> at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/zap_micro.c:622
> #15 0xffffffff80802d06 in zfs_link_create (dl=0xffffff0065554140, zp=0xffffff005ccfac08, tx=0xffffff006591dd00, flag=1)
> at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/zfs_dir.c:564
> #16 0xffffffff8080c01c in zfs_mkdir (ap=0xffffffffaeeb6960) at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c:1474
> #17 0xffffffff804490f9 in VOP_MKDIR_APV (vop=0x12, a=0xffffffffaeeb6960) at vnode_if.c:1234
> #18 0xffffffff80316195 in kern_mkdir (td=0xffffff000105e000, path=0x5149d1 <Address 0x5149d1 out of bounds>, segflg=15549312, mode=511) at vnode_if.h:653
> #19 0xffffffff8041abd0 in syscall (frame=0xffffffffaeeb6c70) at /usr/src/sys/amd64/amd64/trap.c:825
> #20 0xffffffff8040206b in Xfast_syscall () at /usr/src/sys/amd64/amd64/exception.S:272
> #21 0x000000080071969c in ?? ()
> Previous frame inner to this frame (corrupt stack?)
> (kgdb) f 12
> #12 0xffffffff807f092a in fzap_upgrade (zap=0xffffff006582c200, tx=0xffffff006591dd00)
> at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/zap.c:87
> 87 mutex_init(&zap->zap_f.zap_num_entries_mtx, NULL, MUTEX_DEFAULT, 0);
> (kgdb) p zap
> $1 = (zap_t *) 0xffffff006582c200
> (kgdb) p *zap

> $2 = {zap_objset = 0xffffff0001406410, zap_object = 12660, zap_dbuf = 0xffffff005ce892d0, zap_rwlock = {lock_object = {
> lo_name = 0xffffffff8081b416 "zfs:&zap->zap_rwlock", lo_type = 0xffffffff8081b416 "zfs:&zap->zap_rwlock", lo_flags = 41615360, lo_witness_data = {
> lod_list = {stqe_next = 0x0}, lod_witness = 0x0}}, sx_lock = 18446742974215086080, sx_recurse = 0}, zap_ismicro = 0, zap_salt = 965910969,
> zap_u = {zap_fat = {zap_phys = 0xffffffff81670000, zap_num_entries_mtx = {lock_object = {lo_name = 0x70000 <Address 0x70000 out of bounds>,
> lo_type = 0x0, lo_flags = 2155822976, lo_witness_data = {lod_list = {stqe_next = 0x0}, lod_witness = 0x0}}, sx_lock = 1, sx_recurse = 0},
> zap_block_shift = 0}, zap_micro = {zap_phys = 0xffffffff81670000, zap_num_entries = 0, zap_num_chunks = 7, zap_alloc_next = 0, zap_avl = {
> avl_root = 0x0, avl_compar = 0xffffffff807f3f80 <mze_compare>, avl_offset = 0, avl_numnodes = 1, avl_size = 0}}}}

fzap_upgrade() changes type from 'zap_micro' to 'zap_fat' and union is
used for this (see
sys/contrib/opensolaris/uts/common/fs/zfs/sys/zap_impl.h), that's why we
see this trash:

zap_num_entries_mtx = {lock_object = {lo_name = 0x70000 <Address 0x70000 out of bounds>,
lo_type = 0x0, lo_flags = 2155822976, lo_witness_data = {lod_list = {stqe_next = 0x0},
lod_witness = 0x0}}, sx_lock = 1, sx_recurse = 0},

I already use kmem_zalloc() (note _z_) for zap allocation in
zap_micro.c, so Max is right, that we have to clear this structure here.

I'm quite tired of tracking such problems, because our mechanism for
detecting already initialized locks is too simple (based on one bit), so
I'd prefer to improve it, or just add bzero() to mutex_init().

Pawel Jakub Dawidek

unread,
Apr 8, 2007, 9:07:03 PM4/8/07
to
On Sun, Apr 08, 2007 at 08:53:12PM +0200, Pawel Jakub Dawidek wrote:
> fzap_upgrade() changes type from 'zap_micro' to 'zap_fat' and union is
> used for this (see
> sys/contrib/opensolaris/uts/common/fs/zfs/sys/zap_impl.h), that's why we
> see this trash:
>
> zap_num_entries_mtx = {lock_object = {lo_name = 0x70000 <Address 0x70000 out of bounds>,
> lo_type = 0x0, lo_flags = 2155822976, lo_witness_data = {lod_list = {stqe_next = 0x0},
> lod_witness = 0x0}}, sx_lock = 1, sx_recurse = 0},
>
> I already use kmem_zalloc() (note _z_) for zap allocation in
> zap_micro.c, so Max is right, that we have to clear this structure here.
>
> I'm quite tired of tracking such problems, because our mechanism for
> detecting already initialized locks is too simple (based on one bit), so
> I'd prefer to improve it, or just add bzero() to mutex_init().

I just committed a fix. Now I do 13 bits check for already initialized
locks detection instead of standard 1 bit check. Could you repeat your
test?

Max Laier

unread,
Apr 8, 2007, 9:59:24 PM4/8/07
to
On Monday 09 April 2007 03:07, Pawel Jakub Dawidek wrote:
> On Sun, Apr 08, 2007 at 08:53:12PM +0200, Pawel Jakub Dawidek wrote:
> > fzap_upgrade() changes type from 'zap_micro' to 'zap_fat' and union
> > is used for this (see
> > sys/contrib/opensolaris/uts/common/fs/zfs/sys/zap_impl.h), that's why
> > we see this trash:
> >
> > zap_num_entries_mtx = {lock_object = {lo_name = 0x70000 <Address
> > 0x70000 out of bounds>, lo_type = 0x0, lo_flags = 2155822976,
> > lo_witness_data = {lod_list = {stqe_next = 0x0}, lod_witness = 0x0}},
> > sx_lock = 1, sx_recurse = 0},
> >
> > I already use kmem_zalloc() (note _z_) for zap allocation in
> > zap_micro.c, so Max is right, that we have to clear this structure
> > here.
> >
> > I'm quite tired of tracking such problems, because our mechanism for
> > detecting already initialized locks is too simple (based on one bit),
> > so I'd prefer to improve it, or just add bzero() to mutex_init().
>
> I just committed a fix. Now I do 13 bits check for already initialized
> locks detection instead of standard 1 bit check. Could you repeat your
> test?

Will do tomorrow. Thanks.

banshee

unread,
Apr 9, 2007, 8:54:45 AM4/9/07
to

Hello, that is great news! But is it possible to use zfs + gbde?

I have the following configuration:

rc.conf:
gbde_autoattach_all="yes"
gbde_devices="ad0s1g"

fstab:
/dev/ad0s1g.bde /home ufs rw 2 2

So, i have to just type a passphrase at boot time, and in this case, zfs must be on ad0s1g.bde?

--

Contra vim mortis, non est medicaments...

Max Laier

unread,
Apr 9, 2007, 10:13:39 AM4/9/07
to
On Monday 09 April 2007 03:59, Max Laier wrote:
> On Monday 09 April 2007 03:07, Pawel Jakub Dawidek wrote:
> > On Sun, Apr 08, 2007 at 08:53:12PM +0200, Pawel Jakub Dawidek wrote:
...

> > > I'm quite tired of tracking such problems, because our mechanism
> > > for detecting already initialized locks is too simple (based on one
> > > bit), so I'd prefer to improve it, or just add bzero() to
> > > mutex_init().
> >
> > I just committed a fix. Now I do 13 bits check for already
> > initialized locks detection instead of standard 1 bit check. Could
> > you repeat your test?
>
> Will do tomorrow. Thanks.

Confirmed to work for my testcase.

Jeremie Le Hen

unread,
Apr 9, 2007, 1:57:40 PM4/9/07
to
Hi,

On Fri, Apr 06, 2007 at 04:57:00AM +0200, Pawel Jakub Dawidek wrote:
> I'm happy to inform that the ZFS file system is now part of the FreeBSD
> operating system. ZFS is available in the HEAD branch and will be
> available in FreeBSD 7.0-RELEASE as an experimental feature.

Thank you very much for the work Pawel. This is great news.

BTW, does anyone have preliminary performances tests ? I can't do them
has I have no spare disk currently.

Thank you.
Best regards,
--
Jeremie Le Hen
< jeremie at le-hen dot org >< ttz at chchile dot org >

Craig Boston

unread,
Apr 9, 2007, 8:35:05 PM4/9/07
to
On Sat, Apr 07, 2007 at 11:24:14PM +0200, Bernd Walter wrote:
> On Sat, Apr 07, 2007 at 09:15:17PM +0200, Pawel Jakub Dawidek wrote:
> > Just committed a change. You can tune max and min ARC size via
> > vfs.zfs.arc_max and vfs.zfs.arc_min tunnables.
>
> Thanks - I'd set c_max to 80M now and will see what happens, since
> I had such a panic again with 240M kmem.

Hi, just wanted to chime in that I'm experiencing the same panic with
a fresh -CURRENT.

I'm seriously considering trying out ZFS on my home file server (this
should tell you how much I've come to trust pjd's work ;). Anyway,
since it's a repurposed desktop with a crappy board, it's limited to
512MB RAM. So I've been testing in a VMware instance with 512MB. My
vm.kmem_size is defaulting to 169758720.

Works fine up until the point I start copying lots of files onto the ZFS
partition. I tried the suggestion of reducing the tunables. After
modifying the source to accept these values, I have it set to:

kstat.zfs.misc.arcstats.p: 33554432
kstat.zfs.misc.arcstats.c: 67108864
kstat.zfs.misc.arcstats.c_min: 33554432
kstat.zfs.misc.arcstats.c_max: 67108864
kstat.zfs.misc.arcstats.size: 20606976

This is after a clean boot before trying anything. arcstats.size floats
right at the max for quite a while before the panic happens, so I
suspect something else is causing it to run out of kvm, perhaps the
normal buffer cache since I'm copying from a UFS filesystem.

panic: kmem_malloc(131072): kmem_map too small: 131440640 total
allocated

Though the backtrace (assuming I'm loading the module symbols correctly)
seems to implicate zfs.

#0 doadump () at pcpu.h:172
#1 0xc06bbaab in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:409
#2 0xc06bbd38 in panic (
fmt=0xc094f28c "kmem_malloc(%ld): kmem_map too small: %ld total allocated")
at /usr/src/sys/kern/kern_shutdown.c:563
#3 0xc0821e70 in kmem_malloc (map=0xc145408c, size=131072, flags=2)
at /usr/src/sys/vm/vm_kern.c:305
#4 0xc0819d56 in page_alloc (zone=0x0, bytes=131072, pflag=0x0, wait=2)
at /usr/src/sys/vm/uma_core.c:955
#5 0xc081bfcf in uma_large_malloc (size=131072, wait=2)
at /usr/src/sys/vm/uma_core.c:2709
#6 0xc06b0eb1 in malloc (size=131072, mtp=0xc0bd0080, flags=2)
at /usr/src/sys/kern/kern_malloc.c:364
#7 0xc0b66f67 in zfs_kmem_alloc (size=131072, kmflags=2)
at /usr/src/sys/modules/zfs/../../compat/opensolaris/kern/opensolaris_kmem.c:67
#8 0xc0bb23ad in zio_buf_alloc (size=131072)
at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/zio.c:211
#9 0xc0ba4487 in vdev_queue_io_to_issue (vq=0xc3424ee4, pending_limit=Unhandled dwarf expression opcode 0x93
)
at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/vdev_queue.c:213
at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/vdev_queue.c:312
#11 0xc0bc69fd in vdev_geom_io_done (zio=0xc4435400)
at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c:412
#12 0xc0b6ad19 in taskq_thread (arg=0xc2dfa0cc)
at /usr/src/sys/modules/zfs/../../contrib/opensolaris/uts/common/os/taskq.c:833
#13 0xc06a54ba in fork_exit (callout=0xc0b6ac18 <taskq_thread>,
arg=0xc2dfa0cc, frame=0xd62cdd38) at /usr/src/sys/kern/kern_fork.c:814
#14 0xc08a8c10 in fork_trampoline () at /usr/src/sys/i386/i386/exception.s:205

I haven't tried increasing kmem yet -- I'm a bit leery of devoting so
much memory (presumably nonpageable, nonreclaimable) to the kernel.

Admittedly I'm somewhat confused as to why ZFS needs its own special
cache rather than sharing the system's, or at least only use free
physical pages allocated as VM objects rather than precious kmem. But
I'm no VM guru :)

Craig

Craig Boston

unread,
Apr 9, 2007, 8:38:37 PM4/9/07
to
On Mon, Apr 09, 2007 at 07:35:05PM -0500, Craig Boston wrote:
> 512MB RAM. So I've been testing in a VMware instance with 512MB. My
> vm.kmem_size is defaulting to 169758720.

Meh, wrong stat, I probably should have said that

vm.kmem_size_max: 335544320

Kris Kennaway

unread,
Apr 9, 2007, 9:11:25 PM4/9/07
to
On Mon, Apr 09, 2007 at 07:38:37PM -0500, Craig Boston wrote:
> On Mon, Apr 09, 2007 at 07:35:05PM -0500, Craig Boston wrote:
> > 512MB RAM. So I've been testing in a VMware instance with 512MB. My
> > vm.kmem_size is defaulting to 169758720.
>
> Meh, wrong stat, I probably should have said that
>
> vm.kmem_size_max: 335544320

Nah, you were right the first time :) Your system is defaulting to
160MB for the kmem_map, of which zfs will (by default) try to use up
to 3/4. Naturally this doesn't leave much for the rest of the kernel
(40MB), so you'll easily run the kernel out of memory.

For now, you probably want to increase vm.kmem_size a bit to allow
some more room for zfs, and set vfs.zfs.arc_max and arc_min to
something more reasonable like 64*1024*1024+1 (the +1 is needed
because zfs currently requires "greater than 64MB" for the arc).

Kris

Craig Boston

unread,
Apr 9, 2007, 9:30:35 PM4/9/07
to
On Mon, Apr 09, 2007 at 09:11:25PM -0400, Kris Kennaway wrote:
> Nah, you were right the first time :) Your system is defaulting to
> 160MB for the kmem_map, of which zfs will (by default) try to use up
> to 3/4. Naturally this doesn't leave much for the rest of the kernel
> (40MB), so you'll easily run the kernel out of memory.

Hmm, I had already reduced the maximum arc size to 64MB though, which I
figured (hoped?) would leave plenty of room.

So if kmem_size is the total size and it can't grow, what is
kmem_size_max? Is there a way to see a sum of total kmem allocation?
Even the vm.zone breakdown seems to be gone in current so apparently my
knowledge of such things is becoming obsolete :)

> For now, you probably want to increase vm.kmem_size a bit to allow
> some more room for zfs, and set vfs.zfs.arc_max and arc_min to
> something more reasonable like 64*1024*1024+1 (the +1 is needed
> because zfs currently requires "greater than 64MB" for the arc).

Yeah, I found that out the hard way after wondering why it was ignoring
the tunables :)

I ran out of kmem_map space once with it set to 64*1024*1024+1, then I
modified the source so that it would accept zfs_arc_max >= (64 << 20)
instead, just in case it was a power-of-2 thing.

Craig

Craig Boston

unread,
Apr 9, 2007, 9:42:33 PM4/9/07
to
On Mon, Apr 09, 2007 at 08:30:35PM -0500, Craig Boston wrote:
> Even the vm.zone breakdown seems to be gone in current so apparently my
> knowledge of such things is becoming obsolete :)

But vmstat -m still works

...

solaris 145806 122884K - 15319671 16,32,64,128,256,512,1024,2048,4096
...

Whoa! That's a lot of kernel memory. Meanwhile...

kstat.zfs.misc.arcstats.size: 33554944
(which is just barely above vfs.zfs.arc_min)

So I don't think it's the arc cache (yeah I know that's redundant) that
is the problem. Seems like something elsewhere in zfs is allocating
large amounts of memory and not letting it go, and even the cache is
having to shrink to its minimum size due to the memory pressure.

It didn't panic this time, so when the tar finished I tried a "zfs
unmount /usr/ports". This caused the "solaris" entry to drop down to
about 64MB, so it's not a leak. It could just be that ZFS needs lots of
memory to operate if it keeps a lot of metadata for each file in memory.

The sheer # of allocations still seems excessive though. It was well
over 20 million by the time the tar process exited.

Kris Kennaway

unread,
Apr 9, 2007, 9:48:23 PM4/9/07
to
On Mon, Apr 09, 2007 at 08:30:35PM -0500, Craig Boston wrote:
> On Mon, Apr 09, 2007 at 09:11:25PM -0400, Kris Kennaway wrote:
> > Nah, you were right the first time :) Your system is defaulting to
> > 160MB for the kmem_map, of which zfs will (by default) try to use up
> > to 3/4. Naturally this doesn't leave much for the rest of the kernel
> > (40MB), so you'll easily run the kernel out of memory.
>
> Hmm, I had already reduced the maximum arc size to 64MB though, which I
> figured (hoped?) would leave plenty of room.
>
> So if kmem_size is the total size and it can't grow, what is
> kmem_size_max? Is there a way to see a sum of total kmem allocation?
> Even the vm.zone breakdown seems to be gone in current so apparently my
> knowledge of such things is becoming obsolete :)

It's the cap used by the auto-sizing code, i.e. no matter how much RAM
the system has it will never use more than 320MB for kmem, by default.

Currently I think there is no exported way to view the amount of free
space in the map, but there should be.

> > For now, you probably want to increase vm.kmem_size a bit to allow
> > some more room for zfs, and set vfs.zfs.arc_max and arc_min to
> > something more reasonable like 64*1024*1024+1 (the +1 is needed
> > because zfs currently requires "greater than 64MB" for the arc).
>
> Yeah, I found that out the hard way after wondering why it was ignoring
> the tunables :)
>
> I ran out of kmem_map space once with it set to 64*1024*1024+1, then I
> modified the source so that it would accept zfs_arc_max >= (64 << 20)
> instead, just in case it was a power-of-2 thing.

OK. Probably this is a sign that 160 - 64 = 96MB is not enough for
your kernel, i.e. you'd also get the panics if you turned down
vm.kmem_size to 96MB and didn't use zfs.

Kris

Kris Kennaway

unread,
Apr 9, 2007, 9:55:23 PM4/9/07
to
On Mon, Apr 09, 2007 at 08:42:33PM -0500, Craig Boston wrote:
> On Mon, Apr 09, 2007 at 08:30:35PM -0500, Craig Boston wrote:
> > Even the vm.zone breakdown seems to be gone in current so apparently my
> > knowledge of such things is becoming obsolete :)
>
> But vmstat -m still works
>
> ...
>
> solaris 145806 122884K - 15319671 16,32,64,128,256,512,1024,2048,4096
> ...
>
> Whoa! That's a lot of kernel memory. Meanwhile...
>
> kstat.zfs.misc.arcstats.size: 33554944
> (which is just barely above vfs.zfs.arc_min)
>
> So I don't think it's the arc cache (yeah I know that's redundant) that
> is the problem. Seems like something elsewhere in zfs is allocating
> large amounts of memory and not letting it go, and even the cache is
> having to shrink to its minimum size due to the memory pressure.
>
> It didn't panic this time, so when the tar finished I tried a "zfs
> unmount /usr/ports". This caused the "solaris" entry to drop down to
> about 64MB, so it's not a leak. It could just be that ZFS needs lots of
> memory to operate if it keeps a lot of metadata for each file in memory.
>
> The sheer # of allocations still seems excessive though. It was well
> over 20 million by the time the tar process exited.

That is a lifetime count of the # of operations, not the current
number allocated ("InUse").

It does look like there is something else using a significant amount
of memory apart from arc, but arc might at least be the major one due
to its extremely greedy default allocation policy.

Kris

Craig Boston

unread,
Apr 9, 2007, 10:04:55 PM4/9/07
to
On Mon, Apr 09, 2007 at 09:55:23PM -0400, Kris Kennaway wrote:
> That is a lifetime count of the # of operations, not the current
> number allocated ("InUse").

Yes, perhaps I should have said "sheer number of allocations &
deallocations". I was just surprised that it seems to grab and release
memory much more often than anything else tracked by vmstat.

> It does look like there is something else using a significant amount
> of memory apart from arc, but arc might at least be the major one due
> to its extremely greedy default allocation policy.

I wasn't going to post again until somebody suggested trying this, but I
think the name cache can be ruled out. I reduced vfs.zfs.dnlc.ncsize
from ~13000 to 4096 with no appreciable drop in total memory usage.

It seems to be stable with vm.kmem_size at 256MB, but the wired count
has come dangerously close a few times.

Pawel Jakub Dawidek

unread,
Apr 9, 2007, 10:38:57 PM4/9/07
to
On Mon, Apr 09, 2007 at 08:42:33PM -0500, Craig Boston wrote:
> On Mon, Apr 09, 2007 at 08:30:35PM -0500, Craig Boston wrote:
> > Even the vm.zone breakdown seems to be gone in current so apparently my
> > knowledge of such things is becoming obsolete :)
>
> But vmstat -m still works
>
> ...
>
> solaris 145806 122884K - 15319671 16,32,64,128,256,512,1024,2048,4096
> ...
>
> Whoa! That's a lot of kernel memory. Meanwhile...
>
> kstat.zfs.misc.arcstats.size: 33554944
> (which is just barely above vfs.zfs.arc_min)
>
> So I don't think it's the arc cache (yeah I know that's redundant) that
> is the problem. Seems like something elsewhere in zfs is allocating
> large amounts of memory and not letting it go, and even the cache is
> having to shrink to its minimum size due to the memory pressure.

ARC and ZIO are the biggest memory consumers and they are somehow
connected. I just committed changes that should stabilize ZFS in this
regard. Could you try them?

Craig Boston

unread,
Apr 10, 2007, 12:04:31 AM4/10/07
to
On Tue, Apr 10, 2007 at 04:38:57AM +0200, Pawel Jakub Dawidek wrote:
> ARC and ZIO are the biggest memory consumers and they are somehow
> connected. I just committed changes that should stabilize ZFS in this
> regard. Could you try them?

Hrm, well I was attempting to but it panic'd in the middle of the kernel
build (/usr/src and obj are on the test zfs partition). Apparently
256MB isn't enough kmem either. I'll bump it up again and try
rebuilding, and lower it back to 256 for testing.

kmem_malloc(131072): kmem_map too small: 214921216 total allocated

Craig Boston

unread,
Apr 10, 2007, 12:36:10 AM4/10/07
to
On Tue, Apr 10, 2007 at 04:38:57AM +0200, Pawel Jakub Dawidek wrote:
> ARC and ZIO are the biggest memory consumers and they are somehow
> connected. I just committed changes that should stabilize ZFS in this
> regard. Could you try them?

Preliminary results with the latest -current kernel and
vm.kmem_size=268435456, disabling all my other loader.conf entries and
letting it autosize:

kstat.zfs.misc.arcstats.p: 15800320
kstat.zfs.misc.arcstats.c: 16777216
kstat.zfs.misc.arcstats.c_min: 16777216
kstat.zfs.misc.arcstats.c_max: 134217728
kstat.zfs.misc.arcstats.size: 18003456

solaris 43705 91788K - 4522887 16,32,64,128,256,512,1024,2048,4096

So it looks like it autosized the ARC to a 16M-128M range. I'm
currently doing a buildworld and am going to try untarring the ports
tree. The ARC size is tending to hover around 16-20M, probably due to
memory pressure. The "solaris" group appears to be taking up about 16M
less memory than it did before, which is consistent with the ARC being
16M smaller (I had changed the minimum to 32M before reverting to HEAD).

I may poke around in ZIO but judging from the complexity of the code I
don't have much of a chance of really understanding it anytime soon.

In my defense, the machine I was planning to use this on isn't _that_
old. It's a 2Ghz P4, which should be "okay" as far as checksum
calculations go. It just has a braindead motherboard that refuses to
accept more than 512MB of RAM.

Andrey V. Elsukov

unread,
Apr 10, 2007, 1:17:16 AM4/10/07
to
Pawel Jakub Dawidek wrote:
> Limitations.
>
> Currently ZFS is only compiled as kernel module and is only available
> for i386 architecture. Amd64 should be available very soon, the other
> archs will come later, as we implement needed atomic operations.
>
> Missing functionality.
>
> - We don't have iSCSI target daemon in the tree, so sharing ZVOLs via
> iSCSI is also not supported at this point. This should be fixed in
> the future, we may also add support for sharing ZVOLs over ggate.
> - There is no support for ACLs and extended attributes.
> - There is no support for booting off of ZFS file system.
>
> Other than that, ZFS should be fully-functional.

Hi, Pawel. Thanks for the great work!

1. I have an yesterday's CURRENT and I get a `kmem_map too small`
panic when try to copy /usr/src to ZFS partition with enabled
compression. (I have 512M of RAM)

2. I've tried snapshots. Seems that all work good. I have one
question: .zfs directory should be invisible? I can `cd .zfs`
and see it's content, but may be .zfs should be visible like
an ufs's .snap?

--
WBR, Andrey V. Elsukov

Kris Kennaway

unread,
Apr 10, 2007, 2:10:07 AM4/10/07
to
On Tue, Apr 10, 2007 at 09:17:16AM +0400, Andrey V. Elsukov wrote:
> Pawel Jakub Dawidek wrote:
> >Limitations.
> >
> > Currently ZFS is only compiled as kernel module and is only available
> > for i386 architecture. Amd64 should be available very soon, the other
> > archs will come later, as we implement needed atomic operations.
> >
> >Missing functionality.
> >
> > - We don't have iSCSI target daemon in the tree, so sharing ZVOLs via
> > iSCSI is also not supported at this point. This should be fixed in
> > the future, we may also add support for sharing ZVOLs over ggate.
> > - There is no support for ACLs and extended attributes.
> > - There is no support for booting off of ZFS file system.
> >
> >Other than that, ZFS should be fully-functional.
>
> Hi, Pawel. Thanks for the great work!
>
> 1. I have an yesterday's CURRENT and I get a `kmem_map too small`
> panic when try to copy /usr/src to ZFS partition with enabled
> compression. (I have 512M of RAM)

See discussion in many other emails (e.g. mine). Also cvs update.

> 2. I've tried snapshots. Seems that all work good. I have one
> question: .zfs directory should be invisible? I can `cd .zfs`
> and see it's content, but may be .zfs should be visible like
> an ufs's .snap?

I think this is controlled by the 'snapdir' property, see p80 of the
admin guide.

Kris

Rong-en Fan

unread,
Apr 10, 2007, 2:35:25 AM4/10/07
to

Isn't that default 'hidden' ?

Regards,
Rong-En Fan

>
> Kris

Kris Kennaway

unread,
Apr 10, 2007, 2:38:17 AM4/10/07
to
On Tue, Apr 10, 2007 at 02:35:25PM +0800, Rong-en Fan wrote:

> >> 2. I've tried snapshots. Seems that all work good. I have one
> >> question: .zfs directory should be invisible? I can `cd .zfs`
> >> and see it's content, but may be .zfs should be visible like
> >> an ufs's .snap?
> >
> >I think this is controlled by the 'snapdir' property, see p80 of the
> >admin guide.
>
> Isn't that default 'hidden' ?

I thought that is what the claim was, and the question was how to make
it visible :)

Kris

Scot Hetzel

unread,
Apr 10, 2007, 2:28:25 AM4/10/07
to
On 4/10/07, Andrey V. Elsukov <bu7...@yandex.ru> wrote:
> 2. I've tried snapshots. Seems that all work good. I have one
> question: .zfs directory should be invisible? I can `cd .zfs`
> and see it's content, but may be .zfs should be visible like
> an ufs's .snap?
>

>From zfs(1M) man page:
:
Snapshots
:

File system snapshots can be accessed under the ".zfs/snapshot" direc-
tory in the root of the file system. Snapshots are automatically
mounted on demand and may be unmounted at regular intervals. The visi-
bility of the ".zfs" directory can be controlled by the "snapdir" prop-
erty.
:
snapdir=hidden | visible

Controls whether the ".zfs" directory is hidden or visible in the
root of the file system as discussed in the "Snapshots" section.
The default value is "hidden".

Scot
--
DISCLAIMER:
No electrons were mamed while sending this message. Only slightly bruised.

Andrey V. Elsukov

unread,
Apr 10, 2007, 3:03:49 AM4/10/07
to
Kris Kennaway wrote:
> I thought that is what the claim was, and the question was how to make
> it visible :)

Yes, thanks for the answer. Now i've been locked up in the "zfs"
state :)

How to repeat:
# zfs set snapdir=visible media/disk3/src
# ls -la media/disk3/src/.zfs

zfs-report.txt

Kris Kennaway

unread,
Apr 10, 2007, 3:06:28 AM4/10/07
to

\o/

You might need to recompile with DEBUG_LOCKS and DEBUG_VFS_LOCKS and
do 'show lockedvnods', but maybe this is trivially reproducible.

Kris

>
> --
> WBR, Andrey V. Elsukov

> UID PID PPID CPU PRI NI VSZ RSS MWCHAN STAT TT TIME COMMAND
> 0 1455 1127 0 -4 0 18032 3896 zfs D+ v0 0:00,01 mc
> 0 1462 1457 0 -4 0 3416 1236 zfs T+ p0 0:00,00 ls -lA
>
> db> trace 1462
> Tracing pid 1462 tid 100061 td 0xc29c81b0
> sched_switch(c29c81b0,0,1) at sched_switch+0xc7
> mi_switch(1,0) at mi_switch+0x1d4
> sleepq_switch(c3404d18) at sleepq_switch+0x8a
> sleepq_wait(c3404d18,50,0,0,c07315e4,...) at sleepq_wait+0x36
> _sleep(c3404d18,c076f340,50,c2946f6f,0,...) at _sleep+0x24d
> acquire(d3b2e728,80,60000,d3b2e708,d3b2e70c,...) at acquire+0x73
> _lockmgr(c3404d18,3002,c3404d48,c29c81b0,c2941e7b,...) at _lockmgr+0x442
> vop_stdlock(d3b2e770) at vop_stdlock+0x27
> _VOP_LOCK_APV(c294abc0,d3b2e770) at _VOP_LOCK_APV+0x38
> _vn_lock(c3404cc0,1002,c29c81b0,c2941e7b,c4,...) at _vn_lock+0xf8
> domount(c29c81b0,c3404cc0,c2946f6f,c2ce87c0,d3b2e85c,...) at domount+0xfd
> zfsctl_snapdir_lookup(d3b2eacc) at zfsctl_snapdir_lookup+0x1ac
> VOP_LOOKUP_APV(c294adc0,d3b2eacc) at VOP_LOOKUP_APV+0x43
> lookup(d3b2eb50) at lookup+0x4c0
> namei(d3b2eb50) at namei+0x2d2
> kern_lstat(c29c81b0,2821c268,0,d3b2ec24) at kern_lstat+0x47
> lstat(c29c81b0,d3b2ed00) at lstat+0x1b
> syscall(d3b2ed38) at syscall+0x29e
> Xint0x80_syscall() at Xint0x80_syscall+0x20
> --- syscall (190, FreeBSD ELF32, lstat), eip = 0x2818d267, esp = 0xbfbfe3fc, ebp
> = 0xbfbfe498 ---
> db> cont

Andrey V. Elsukov

unread,
Apr 10, 2007, 4:12:21 AM4/10/07
to
Kris Kennaway wrote:
> \o/
>
> You might need to recompile with DEBUG_LOCKS and DEBUG_VFS_LOCKS and
> do 'show lockedvnods', but maybe this is trivially reproducible.

I've rollbacked and destroyed this snapshot and now don't have this
problem. But i have several LOR.

zfs-report.txt

Ivan Voras

unread,
Apr 10, 2007, 5:11:18 AM4/10/07
to
Wilko Bulte wrote:
> On Sat, Apr 07, 2007 at 12:54:57PM +0200, Jorn Argelo wrote..
>> Rich Teer wrote:
>>> This is fantastic news! At the risk of raking over ye olde arguments,
>>> as the old saying goes: "Dual licensing? We don't need no stinkeen
>>> dual licensing!". :-)
>>>
>>>
>> First of all, thanks a lot for all the hard work of both the FreeBSD
>> developers as the ZFS developers. I can't wait to give it a go.
>>
>> That leads me to one question though: Why is *BSD able to bring it into
>> the OS as where Linux has licensing problems with the CDDL? AFAIK Linux
>> users can only run it in userland mode and not in kernel mode because of
>> the licenses.
>
> My guess(!) is that they do not want non-GPL-ed code in the standard kernel.

Sorry if I'm reiterating what someone maybe already explained, but I
don't see it on the lists I read:

FreeBSD can include GPL'ed code due to a "technicality" (literally): As
long as the code is in a separate kernel module and not in the default
shipped GENERIC kernel, it's considered "bundled" and not a part of the
kernel. As soon as the user loads a GPLed kernel module, presto-changeo!
his kernel "automagically" becomes GPLed. I believe the same holds for
CDDL. (I have no idea how to resolve the licensing issues of a kernel
with both GPL and CDDL parts :) ). This is less inconvenient than it
seems since kernel modules can be (pre)loaded at the same time the
kernel loads, and so we can have a ZFS root partition, etc.

The problem with DTrace in FreeBSD is twofold:

1. It's much more intertwined with the kernel.
2. Much of its usability comes from it being available in the default
shipped kernel - so that users can use it to troubleshoot problems "on
the fly" without having to recompile and install a new kernel (involves
rebooting).

AFAIK (not involved with its development), most of dtrace can reside in
a kernel module but some parts need to be in the kernel proper to
support this mode of operation, and *this* is where the licensing comes
in. Just a few files (AFAIK: mostly header files!) need to be
dual-licensed so they can be included in the default kernel build, and
the rest can be in the CDDL licensed kernel module.


signature.asc

Hartmut Brandt

unread,
Apr 10, 2007, 5:25:47 AM4/10/07
to


I had some discussion with folks at Sun (indirectly via another guy)
while they were in the process of making the CDDL: They said:
Modifications to CDDL code must be under CDDL. This means if you change
a CDDLed file, your changes are CDDL. If you add a line to the CDDL code
that calls a function in another, new file, you're free to put that
other file under any license as long as there is a compatibility the
other way 'round - you probably cannot put that file under GPL, but you
can put it under BSD. The new file is not a modification of the CDDLed code.

harti

Kris Kennaway

unread,
Apr 10, 2007, 2:58:21 PM4/10/07
to

Some of these are already known, at least. Also please try to
recreate the deadlock.

Thanks,
Kris

David Schultz

unread,
Apr 11, 2007, 5:49:11 PM4/11/07
to
On Sat, Apr 07, 2007, Dag-Erling Smrgrav wrote:
> Bernd Walter <ti...@cicely12.cicely.de> writes:
> > On Sat, Apr 07, 2007 at 09:43:59PM +0200, Dag-Erling Smørgrav wrote:
> > > ZFS is now also available on pc98 and amd64.
> > Great to read - is it just atomic.S missing for the remaining
> > architectures?
>
> Yes. Ideally, ZFS would use FreeBSD's atomic operations instead of
> its own. I believe that the reason it doesn't is (at least in part)
> that we don't have 64-bit atomic operations for i386. I have
> unfinished patches for cleaning up the atomic operations on all
> platforms; I'll dust them off and see what I can do.

As I recall, Solaris 10 targets PPro and later processors, whereas
FreeBSD supports everything back to a 486DX. Hence we can't
assume that cmpxchg8b is available. The last time I remember this
coming up, people argued that we had to do things slow way in the
default kernel for compatibility.

Any ideas how ZFS and GEOM are going to work out, given that ZFS
is designed to be the filesystem + volume manager in one?

Anyway, this looks like awesome stuff! Unfortunately, I won't have
any time to play with it much in the short term, but as soon as WD
sends me the replacement for my spare disk I'll at least install
ZFS and see how it goes.

Awesome work, once again. Thanks!

Bernd Walter

unread,
Apr 11, 2007, 6:51:25 PM4/11/07
to
On Wed, Apr 11, 2007 at 05:49:11PM -0400, David Schultz wrote:
> On Sat, Apr 07, 2007, Dag-Erling Smrgrav wrote:
> > Bernd Walter <ti...@cicely12.cicely.de> writes:
> > > On Sat, Apr 07, 2007 at 09:43:59PM +0200, Dag-Erling Smørgrav wrote:
> > > > ZFS is now also available on pc98 and amd64.
> > > Great to read - is it just atomic.S missing for the remaining
> > > architectures?
> >
> > Yes. Ideally, ZFS would use FreeBSD's atomic operations instead of
> > its own. I believe that the reason it doesn't is (at least in part)
> > that we don't have 64-bit atomic operations for i386. I have
> > unfinished patches for cleaning up the atomic operations on all
> > platforms; I'll dust them off and see what I can do.

I already did a good cleanup of arm atomic functions based on your
work a while ago.

> As I recall, Solaris 10 targets PPro and later processors, whereas
> FreeBSD supports everything back to a 486DX. Hence we can't
> assume that cmpxchg8b is available. The last time I remember this
> coming up, people argued that we had to do things slow way in the
> default kernel for compatibility.

486 support is definitively needed, but it is very unlikely that many
real existing 486 system has enough RAM for ZFS.
AFAIK a ELAN520 can have up to 256MB, but I doubt that one would
spend so much RAM for such a system without better use for it.
Not shure about 586, this is more likely.
But I'm not very familar with x86 assembly, so I don't even know which
CPUs have cmpxchg8b.
If ZFS wouldn't be so greedy I might have used it on flash media for
x86 and ARM systems, but those boards usually don't have enough RAM.

> Any ideas how ZFS and GEOM are going to work out, given that ZFS
> is designed to be the filesystem + volume manager in one?

Although you want to use ZFS RAID functionality GEOM has still many
goodies avalable, such as md, ggate, partition-parsing, encyption, etc.
There are other cool points, which I've found possible lately.
E.g. replace all RAIDZ drives with bigger ones, export/import the
pool and you have additional storage with the same number of drives.
You just need a single additional drive at the same time, which is
great in case you are short on drive bays.
In case you accidently added a drive you didn't want to, you can't
easily remove it, but you can workaround by replacing it with another
one, which is equal or bigger in size.
A short time workaround in such a case until you can backup/restore or
replace the wrong drive with a long standing drive, you can use sparse
md-vnode devices, ggate or gconcat ones.
You just have to be carefull with sparse files, since ZFS don't care
about it when filling with data, but you can at least detach your USB
or firewire drive and hopefully live with the situation a few days.
Today I tested a 6T Volume with sparse md files.
This all worked really great.

--
B.Walter http://www.bwct.de http://www.fizon.de
be...@bwct.de in...@bwct.de sup...@fizon.de


_______________________________________________
freebsd...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-curre...@freebsd.org"

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-...@muc.de

Bernd Walter

unread,
Apr 11, 2007, 10:12:52 PM4/11/07
to
On Wed, Apr 11, 2007 at 04:10:46PM -0700, Louis Kowolowski wrote:
> On Thu, Apr 12, 2007 at 12:51:25AM +0200, Bernd Walter wrote:
> ...

> > 486 support is definitively needed, but it is very unlikely that many
> > real existing 486 system has enough RAM for ZFS.
> > AFAIK a ELAN520 can have up to 256MB, but I doubt that one would
> > spend so much RAM for such a system without better use for it.
> > Not shure about 586, this is more likely.
> > But I'm not very familar with x86 assembly, so I don't even know which
> > CPUs have cmpxchg8b.
> > If ZFS wouldn't be so greedy I might have used it on flash media for
> > x86 and ARM systems, but those boards usually don't have enough RAM.
> >
> I'm some people would be interested in being able to use ZFS with boxes like
> Soekris for NAS (FreeNAS comes to mind) type stuff...

I'm currently running an NFS fileserver with 384M RAM, which seems to
work with some restrictions, but it is also putting pressure on the CPU,
which is a 700MHz PIII and this is not only while accessing compressed
data.
You might be able to get it running on a 256MB 4801, but don't expect
any speed wonders.
The upcoming 5501 might be a good candidate if populated with much RAM.
If I got the prototype picture on soekris.com right they have 512MBit
chips soldered, which gives 256MB only - more than enough for most
embedded use, but not with ZFS as it stands right now...
That said - I don't know what the default population really will be.

Wilkinson, Alex

unread,
Apr 11, 2007, 10:57:46 PM4/11/07
to
0n Thu, Apr 12, 2007 at 12:51:25AM +0200, Bernd Walter wrote:

partition-parsing ? got any info on that ? I have never heard of it.

-aW

IMPORTANT: This email remains the property of the Australian Defence Organisation and is subject to the jurisdiction of section 70 of the CRIMES ACT 1914. If you have received this email in error, you are requested to contact the sender and delete the email.

Peter Jeremy

unread,
Apr 12, 2007, 3:36:06 AM4/12/07
to
On 2007-Apr-11 17:49:11 -0400, David Schultz <d...@freebsd.org> wrote:
>As I recall, Solaris 10 targets PPro and later processors, whereas
>FreeBSD supports everything back to a 486DX. Hence we can't
>assume that cmpxchg8b is available.

There's a feature bit (CPUID_CX8) that advertises the availability of
cmpxchg8b (and maybe some related instructions). My pre-MMX 586 has
this bit set so I presume anything later than 486 will support it.
(I'm not sure about the low-end VIA, GEODE etc clones).

> The last time I remember this
>coming up, people argued that we had to do things slow way in the
>default kernel for compatibility.

I agree that GENERIC should run on lowest-common-denominator hardware
(the definition of that is a subject for a different thread). GENERIC
performance could be enhanced by using an indirect call for 8-byte
atomic instructions and selecting between the cmpxchg8b and
alternative implementation as part of the CPU startup (much like
i586_bcopy). If CPU_486 is not defined, you code could inline the
cmpxchg8b-based variant.

--
Peter Jeremy

Dag-Erling Smørgrav

unread,
Apr 12, 2007, 5:39:35 AM4/12/07
to
Louis Kowolowski <lou...@cryptomonkeys.com> writes:
> I'm some people would be interested in being able to use ZFS with boxes like
> Soekris for NAS (FreeNAS comes to mind) type stuff...

I don't think a Soekris will cut the mustard. A NAS would need a
large case to hold the disks anyway, so you might as well use an EPIA
board; most C3 / C7 boards can take 1 GB, and they don't cost more
than a Soekris.

DES
--
Dag-Erling Smørgrav - d...@des.no

Oliver Fromme

unread,
Apr 12, 2007, 5:58:17 AM4/12/07
to
Dag-Erling Smørgrav wrote:

> Peter Jeremy wrote:
> > There's a feature bit (CPUID_CX8) that advertises the availability of
> > cmpxchg8b (and maybe some related instructions). My pre-MMX 586 has
> > this bit set so I presume anything later than 486 will support it.
> > (I'm not sure about the low-end VIA, GEODE etc clones).
>
> The Geode is a 486, and does not support it.

No, it's a 586-class processor. But you're right in
that it does not seem to support cmpxchg8b. I have an
old 233 MHz Geode currently running FreeBSD 4.6 (please
no comments, it's my standalone mp3 player at home and
not connected to the internet so I didn't care to update
it yet, but I certainly will update it when I have some
time). The kernel reports:

CPU: Cyrix GXm (232.74-MHz 586-class CPU)
Origin = "CyrixInstead" Id = 0x540 DIR=0x8246 Stepping=8 Revision=2

There's no "Features=" line, though. Maybe the Geode
does not support the cpuid at all. Whether it supports
cmpxchg8b is not 100% clear, but my guess would be "no".

> The C3 however is a 586.

In fact it's a 686.

> The C3 Ezra and C3 Samuel / Samuel 2 do not have CX8.
> I'm not sure about the C3 Nehemiah, I don't have one
> running at the moment.

I have a 1000 MHz C3 Nehemiah which is my home file server
(NFS and SMB), among other things (Squid, Apache, FW).
It does not support cmpxchg8b either, according to the
cpuid feature bits:

CPU: VIA C3 Nehemiah+RNG+AES (1002.28-MHz 686-class CPU)
Origin = "CentaurHauls" Id = 0x698 Stepping = 8
Features=0x381b83f<FPU,VME,DE,PSE,TSC,MSR,SEP,MTRR,PGE,CMOV,PAT,MMX,FXSR,SSE>

It's currently running 6-stable, but I would very much
like to update it to -current and use ZFS for the file
server volumes. I hope the absence of cmpxchg8b won't
make that impossible.

(It has 512 MB RAM, which should be sufficient to run
ZFS, right? The squid process also takes quite some
memory, but I've configured it to be rather small.
After all this is only a private home server. I'm not
planning to use compression, but maybe encryption (GELI)
for a small part of it.)

> > I agree that GENERIC should run on lowest-common-denominator hardware
> > (the definition of that is a subject for a different thread). GENERIC
> > performance could be enhanced by using an indirect call for 8-byte
> > atomic instructions and selecting between the cmpxchg8b and
> > alternative implementation as part of the CPU startup (much like
> > i586_bcopy). If CPU_486 is not defined, you code could inline the
> > cmpxchg8b-based variant.

That wouldn't work on the C3 Nehemiah, I'm afraid. CPU_486
is not defined there (in fact I only have I686_CPU in my
kernel config), but it does not support cmpxchg8b according
to the dmesg output above. So the CPU class alone is not
sufficient to decide about the use of cmpxchg8b; you have
to check the actual CPU Features bit.

Best regards
Oliver

--
Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing b. M.
Handelsregister: Registergericht Muenchen, HRA 74606, Geschäftsfuehrung:
secnetix Verwaltungsgesellsch. mbH, Handelsregister: Registergericht Mün-
chen, HRB 125758, Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart

FreeBSD-Dienstleistungen, -Produkte und mehr: http://www.secnetix.de/bsd

C++: "an octopus made by nailing extra legs onto a dog"
-- Steve Taylor, 1998

Bernd Walter

unread,
Apr 12, 2007, 7:31:39 AM4/12/07
to
On Thu, Apr 12, 2007 at 10:57:46AM +0800, Wilkinson, Alex wrote:
> 0n Thu, Apr 12, 2007 at 12:51:25AM +0200, Bernd Walter wrote:
> >On Wed, Apr 11, 2007 at 05:49:11PM -0400, David Schultz wrote:
> >> On Sat, Apr 07, 2007, Dag-Erling Smrgrav wrote:
> >> > Bernd Walter <ti...@cicely12.cicely.de> writes:
> >> > > On Sat, Apr 07, 2007 at 09:43:59PM +0200, Dag-Erling Smørgrav wrote:
> >
> >Although you want to use ZFS RAID functionality GEOM has still many
> >goodies avalable, such as md, ggate, partition-parsing, encyption, etc.
> >There are other cool points, which I've found possible lately.
>
> partition-parsing ? got any info on that ? I have never heard of it.

Well - you have mutliple ways to partition your drives.
bsdlabel, sunlabel, fdisk, gpt, ...
GEOM has classes which detects and parses them all.
It it resposible to get you /dev/*s1a entries and so on.

Takeshi Ken Yamada

unread,
Apr 12, 2007, 11:41:51 PM4/12/07
to
Great work!
It works without any problems so far with my Opteron(dual core)X2
-current box.

Are there any i/o performance comparison data with UFS, even rough one?

Pawel Jakub Dawidek

unread,
Apr 13, 2007, 6:00:06 AM4/13/07
to
On Fri, Apr 13, 2007 at 12:41:51PM +0900, Takeshi Ken Yamada wrote:
> Great work!
> It works without any problems so far with my Opteron(dual core)X2
> -current box.
>
> Are there any i/o performance comparison data with UFS, even rough one?

There are some numbers in my asiabsdcon paper, but ZFS was optimized
quite a bit in several areas until then.

Craig Boston

unread,
Apr 13, 2007, 10:35:50 AM4/13/07
to
On Fri, Apr 13, 2007 at 05:34:56PM +1000, Bruce Evans wrote:
> Doesn't everyone who uses atomic operations knows that they are expensive?
> :)

Yes, though hopefully they should at least be faster than using a
mutex, though for cmpxchg8b it sounds like that may not necessarily be
the case...

Craig

Robert Watson

unread,
Apr 13, 2007, 10:47:52 AM4/13/07
to

On Fri, 13 Apr 2007, Craig Boston wrote:

> On Fri, Apr 13, 2007 at 05:34:56PM +1000, Bruce Evans wrote:
>
>> Doesn't everyone who uses atomic operations knows that they are expensive?
>> :)
>
> Yes, though hopefully they should at least be faster than using a mutex,
> though for cmpxchg8b it sounds like that may not necessarily be the case...

A common example of this not being the case is statistics updates: it doesn't
take too many statistics being updated at once before it makes more sense to
use a mutex than individual atomic instructions, as mutex lock and unlock, in
the uncontended case, involve an atomic instruction each (with memory
barriers). Then it becomes more semantic: is using non-blocking primitives
preferable, or are there consistency requirements between "atomically" updated
fields? If contention never happens, then maybe you get consistency for free
by using a mutex.

As a general rule, unless it's a very clear-cut case (a simple counter), I
would encourage people to program with mutexes rather than directly with
atomic instructions. It prevents them from having to deal with really weird
stuff that happens with weaker memory consistency.

Robert N M Watson
Computer Laboratory
University of Cambridge


_______________________________________________
freebsd...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-curre...@freebsd.org"

--

Sergey Zaharchenko

unread,
Apr 13, 2007, 11:29:32 AM4/13/07
to
[cc list trimmed]

Hello Oliver!

Fri, Apr 13, 2007 at 04:52:45PM +0200 you wrote:

> Using cmpxchg8b with a lock prefix wouldn't be a good idea
> anyway. If I remember correctly, the lock cmpxchg8b
> combination was the cause of the infamous "F00F" bug of
> old Pentium processors. It causes them to freeze.

AFAICT the bug only manifested itself when the instruction had an
invalid register operand:

www.intel.com/support/processors/pentium/ppiie/ :

> It is illegal to use a register as the destination. ... If a
> register is used as the destination, the processor normally stops
> execution of the CMPXCH8B instruction, signals this error
> condition and executes an error handler in software.
> This erratum occurs if the CMPXCHG8B instruction is also locked ...
> and an invalid register destination is used.

So normal instructions should be OK. The fix was there to protect the
system from malicious code which could hang it.

--
DoubleF
No virus detected in this message. Ehrm, wait a minute...
/kernel: pid 56921 (antivirus), uid 32000: exited on signal 9
Oh yes, no virus:)

Dag-Erling Smørgrav

unread,
Apr 13, 2007, 11:59:52 AM4/13/07
to
Andrew Reilly <andrew-...@areilly.bpc-users.org> writes:
> Apart from the fact that you are correct, how long is the
> instruction encoding of cmpxchg8?

Three bytes (0F C7 m64), four for "lock cmpxchg8" (F0 0F C7 m64). If
the top two bits of m64 are set, you may get "interesting" results :)

DES
--
Dag-Erling Smørgrav - d...@des.no

Dag-Erling Smørgrav

unread,
Apr 13, 2007, 12:16:40 PM4/13/07
to
Oliver Fromme <ol...@lurza.secnetix.de> writes:
> Using cmpxchg8b with a lock prefix wouldn't be a good idea anyway.
> If I remember correctly, the lock cmpxchg8b combination was the
> cause of the infamous "F00F" bug of old Pentium processors. It
> causes them to freeze.

Only when the operand is invalid. This causes an invalid opcode
exception which can not be handled because the memory bus is locked,
preventing the handler from beig loaded into cache.

> (FreeBSD has a hack to work around the problem, as you certainly
> know ... I don't know exactly how it works.)

By marking the interrupt descriptor table read-only, the invalid
opcode exception triggers a page fault, which unlocks the bus. The
page fault handler examines the state of the CPU, determine that an
invalid opcode exception occurred, and passes control to the
appropriate handler (which sends SIGILL to the offending process).

Additionally, to avoid penalizing other exceptions, the IDT is aligned
such that it crosses a page boundary immediately after the entry for
the invalid opcode exception, so only the first six entries in the IDT
needs to be read-only.

DES
--
Dag-Erling Smørgrav - d...@des.no
_______________________________________________
freebsd...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-curre...@freebsd.org"

--

Dag-Erling Smørgrav

unread,
Apr 13, 2007, 2:13:02 PM4/13/07
to
Oliver Fromme <ol...@lurza.secnetix.de> writes:
> Just a quick question: Does ZFS still work reliable when the write
> cache for ATA disks is enabled, i.e. with the line "hw.ata.wc=1" in
> /boot/loader.conf?

Yes, as long as the disk doesn't lie about flushing its cache. Some
early ATA disks faked the FLUSHCACHE command to improve their
benchmark results. I don't know if any still do.

Ulrich Spoerlein

unread,
Apr 13, 2007, 2:44:25 PM4/13/07
to
Rick C. Petty wrote:
> On Thu, Apr 12, 2007 at 01:51:59PM -0500, Craig Boston wrote:
> > For something this low level my opinion is it's better to stay with
> > compile time options. After all, in the above example, cmpxchg8 is a
> > single machine instruction. How much overhead does it add to retrieve a
> > variable from memory and check it, then jump to the correct place?
> > Enough that it outweighs the benefit of using that instruction in the
> > first place?
>
> [...]
> The problem is that ZFS would be compiled (by default) to work for many
> platforms, and thus a majority of systems wouldn't get the nice
> optimization.

Disclaimer: I have no clue what cmpxchg8 actually does, but ...

We are talking about optimizing a filesystem by speeding up the
necessary CPU computations. Now, whenever the CPU waits for I/O (which
the ZFS threads will do plenty of times) it has literally thousands of
cycles to burn.

I don't see how this could possibly make ZFS any faster if it does not
avoid I/O operations entirely.

Ulrich Spoerlein
--
"The trouble with the dictionary is you have to know how the word is
spelled before you can look it up to see how it is spelled."
-- Will Cuppy

Oliver Fromme

unread,
Apr 13, 2007, 12:26:17 PM4/13/07
to

Sergey Zaharchenko wrote:
> Hello Oliver!
>
> Fri, Apr 13, 2007 at 04:52:45PM +0200 you wrote:
> > Using cmpxchg8b with a lock prefix wouldn't be a good idea
> > anyway. If I remember correctly, the lock cmpxchg8b
> > combination was the cause of the infamous "F00F" bug of
> > old Pentium processors. It causes them to freeze.
>
> AFAICT the bug only manifested itself when the instruction had an
> invalid register operand:
>
> www.intel.com/support/processors/pentium/ppiie/ :

Ah, that's good then. Thanks for the clarification!

Best regards
Oliver

--
Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing b. M.
Handelsregister: Registergericht Muenchen, HRA 74606, Geschäftsfuehrung:
secnetix Verwaltungsgesellsch. mbH, Handelsregister: Registergericht Mün-
chen, HRB 125758, Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart

FreeBSD-Dienstleistungen, -Produkte und mehr: http://www.secnetix.de/bsd

"And believe me, as a C++ programmer, I don't hesitate to question
the decisions of language designers. After a decent amount of C++
exposure, Python's flaws seem ridiculously small." -- Ville Vainio

Bjoern A. Zeeb

unread,
Apr 30, 2007, 5:55:59 AM4/30/07
to

The two news ones got added to 'The LOR page':

>lock order reversal:
> 1st 0xc2be9154 zfs:&db->db_mtx (zfs:&db->db_mtx) @
> sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/dnode.c:318
> 2nd 0xc2c94b20 zfs:&zp->z_lock (zfs:&zp->z_lock) @
> sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/zfs_znode.c:73

LOR ID #209
http://sources.zabbadoz.net/freebsd/lor.html#209

>lock order reversal:
> 1st 0xc2c3e818 zfs:&ds->ds_deadlist.bpl_lock (zfs:&ds->ds_deadlist.bpl_lock) @
> sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/bplist.c:154
> 2nd 0xc2be63a0 zfs:&dn->dn_struct_rwlock (zfs:&dn->dn_struct_rwlock) @
> sys/modules/zfs/../../contrib/opensolaris/uts/common/fs/zfs/dnode.c:571

LOR ID #210
http://sources.zabbadoz.net/freebsd/lor.html#210

--
Bjoern A. Zeeb bzeeb at Zabbadoz dot NeT

Pawel Jakub Dawidek

unread,
Apr 30, 2007, 5:30:43 PM4/30/07
to
On Mon, Apr 30, 2007 at 11:28:22PM +0200, Peter Schuller wrote:
> > Hi, just wanted to chime in that I'm experiencing the same panic with
> > a fresh -CURRENT.
>
> I am also/still seeing the "kmem_map too small" panic on a tree cvsup:ed
> around April 27.
>
> I can consistently trigger it by doing "rsync -a /usr/ports
> /somepool/somepath", with both /usr and /somepool being on ZFS
> (different pools). This is on a machine with 1 GB memory, with the kmem
> size being 320 MB as per default.
>
> The kstat.zfs.misc.arcstats.size never jumps; the "solaris" memory usage
> never significantly jumps - stays between about 80 MB and 150 MB at all
> times. It does not even consistently increase in size within this range
> - it goes up and down.
>
> In terms of absolute sizes, nothing in the output of vmstat -m, except
> solaris, is even approaching the sizes we are talking about (everything
> is a handful of megs at most).
>
> Watching "top" during the rsync I can see wired memory steadily
> increasing. Starting at about 110 megs or so after startup (including
> parts of my desktop), it begins consistently increasing when I run the
> rsync. in this case I started to approach 200 megs. When rsync was done
> (ran it with -v this time) reading the source directory and began
> copying files, the growth of wired memory increased significantly in
> speed (it was up to 280 MB or so in under 30 seconds).
>
> Killing rsync did not cause the wired total to go down.
>
> Any suggestions on whether there is something else to monitor to find
> out what is using all the memory?
>
> zfs_kmem_alloc() always allocates with M_SOLARIS.
> kmem_cache_{create,alloc} don't, but they seem to be allocating very
> small amounts of memory (could there be leakage of a huge number of
> these?). Is it expected that ZFS would allocate significant amount of
> memory that is not categorized as M_SOLARIS?
>
> Could there be fragmentation going on? Are there very large allocations,
> relative to the 320 MB kmem size, intermixed with small allocations?
>
> Anything I can do in terms of testing that would help debug this, beyond
> what has already been done and reported on on -current?

What you're seeing is probably another problem, which was described
already. Try tunning kern.maxvnodes down to 3/4 of the current value,
see if that helps and please report back.

Peter Schuller

unread,
Apr 30, 2007, 5:56:02 PM4/30/07
to
> What you're seeing is probably another problem, which was described
> already.

My apologies. I see this was mentioned on -fs (but for some reason
doesn't show up in Google). I'll subscribe to that and remember to check
the archive before generating more noise in the future.

> Try tunning kern.maxvnodes down to 3/4 of the current
> value, see if that helps and please report back.

This does seem to eliminate the problem here too.

Again, apologies for the noise, and thank you very much.

--
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller <peter.s...@infidyne.com>'
Key retrieval: Send an E-Mail to getp...@scode.org
E-Mail: peter.s...@infidyne.com Web: http://www.scode.org


signature.asc

Peter Schuller

unread,
May 1, 2007, 9:12:13 AM5/1/07
to
> This does seem to eliminate the problem here too.

It appears the problem persists, but is more difficult to trigger.

I had a reboot again during building of ports. I decreased maxvnodes
further (to about 2/3 of the default, instead of the recommended 4/3 of
the default). Even after that, I had another reboot just now, also
during building of ports.

It takes on the order of several hours to trigger it.

Note that I say "reboot" because that's what it was; it appears that
when this happens when in X (which I suspect is the triggering
difference) a reboot is triggered immediately, while at the console I
get the kernel debugger.

As a result I cannot say for certain that I am still seeing the same
problem; it is only an assumption at this point.

Because swap is on a glabel I have crashdumps turned off, as I was not
sure whether it was safe (i.e., I don't want crashdumps to accidentally
write to the wrong partition).

signature.asc

Kris Kennaway

unread,
May 1, 2007, 12:02:13 PM5/1/07
to
On Tue, May 01, 2007 at 10:41:10AM -0400, Rick Macklem wrote:

>
>
> On Tue, 1 May 2007, Peter Schuller wrote:
>
> >>This does seem to eliminate the problem here too.
> >
> >It appears the problem persists, but is more difficult to trigger.
> [stuff snipped]

> >It takes on the order of several hours to trigger it.
>
> I don't know if it relevent, but I've seen "kmem_map: too small" panics
> when testing my NFSv4 server, ever since about FreeBSD5.4. There is no
> problem running the same server code on FreeBSD4 (which is what I still
> run in production mode) or OpenBSD3 or 4. If I increase the size of the
> map, I can delay the panic for up to about two weeks of hard testing,
> but it never goes away. I don't see any evidence of a memory leak during
> the several days of testing leading up to the panic. (NFSv4 uses
> MALLOC/FREE extensively for state related structures.)

Sounds exactly like a memory leak to me. How did you rule it out?

> So, I'm wondering if maybe there is some subtle bug in MALLOC/FREE (maybe
> i386 specific, since that's what I test on)?

That would be unlikely.

Kris

Kris Kennaway

unread,
May 1, 2007, 6:20:16 PM5/1/07
to
On Tue, May 01, 2007 at 04:39:09PM -0400, Rick Macklem wrote:

>
>
> On Tue, 1 May 2007, Kris Kennaway wrote:
> >>I don't know if it relevent, but I've seen "kmem_map: too small" panics
> >>when testing my NFSv4 server, ever since about FreeBSD5.4. There is no
> >>problem running the same server code on FreeBSD4 (which is what I still
> >>run in production mode) or OpenBSD3 or 4. If I increase the size of the
> >>map, I can delay the panic for up to about two weeks of hard testing,
> >>but it never goes away. I don't see any evidence of a memory leak during
> >>the several days of testing leading up to the panic. (NFSv4 uses
> >>MALLOC/FREE extensively for state related structures.)
> >
> >Sounds exactly like a memory leak to me. How did you rule it out?
> Well, I had a little program running on the server that grabbed the
> mti_stats[] out of the kernel and logged them. I had one client mounted
> running thousands of passes of the Connectathon basic tests (one client,
> same activity over and over and over again). For a week, the stats don't
> show any increase in allocation for any type (alloc - free doesn't get
> unreasonably big), then..."panic: kmem_map too small". How many days it
> took to happen would vary with the size of the kernel map, but no evidence
> of a leak prior to the crash. It seemed to be based on the number of times
> MALLOC and FREE were called.

Or something else is leaking. Really, if there was a problem with
MALLOC/FREE we'd see it.

Kris

Robert Watson

unread,
May 2, 2007, 10:53:50 AM5/2/07
to
On Tue, 1 May 2007, Rick Macklem wrote:

> On Tue, 1 May 2007, Kris Kennaway wrote:
>
>>> I don't know if it relevent, but I've seen "kmem_map: too small" panics
>>> when testing my NFSv4 server, ever since about FreeBSD5.4. There is no
>>> problem running the same server code on FreeBSD4 (which is what I still
>>> run in production mode) or OpenBSD3 or 4. If I increase the size of the
>>> map, I can delay the panic for up to about two weeks of hard testing, but
>>> it never goes away. I don't see any evidence of a memory leak during the
>>> several days of testing leading up to the panic. (NFSv4 uses MALLOC/FREE
>>> extensively for state related structures.)
>>
>> Sounds exactly like a memory leak to me. How did you rule it out?

> Well, I had a little program running on the server that grabbed the
> mti_stats[] out of the kernel and logged them. I had one client mounted
> running thousands of passes of the Connectathon basic tests (one client,
> same activity over and over and over again). For a week, the stats don't
> show any increase in allocation for any type (alloc - free doesn't get
> unreasonably big), then..."panic: kmem_map too small". How many days it took
> to happen would vary with the size of the kernel map, but no evidence of a
> leak prior to the crash. It seemed to be based on the number of times MALLOC
> and FREE were called.
>

> Also, the same server code (except for the port changes, which have nothing
> to do with the state handling where MALLOC/FREE get called a lot), works
> fine for months on FreeBSD4 and OpenBSD3.9.
>
> So, I won't say a "memory leak is ruled out", but if there was a leak why
> wouldn't it bite FreeBSD4 or show up in mti_stats[]?
>
> I first saw it on FreeBSD6.0, but went back to FreeBSD5.4 and tried the same
> test and got the same result.

Historically, such panics have been a result of one of two things:

(1) An immediate resource leak in UMA(9) or malloc(9) allocated memory.

(2) Mis-tuning of a resource limit, perhaps due to sizing the limit based on
solely physical memory size, not taking available kernel address space
into account.

mti_stats reports only on malloc(9), you need to also look at uma(9), since
many frequently allocated types are allocated directly with the slab
allocator, and not from kernel malloc. Take a look at the output of "show
uma" or "show malloc" in DDB, or respectively "vmstat -z" and "vmstat -m" on a
core or on a live system. malloc(9) is actually implemented using two
different back-ends: UMA-managed fixed size memory buckets for small
allocations, and direct page allocation for large allocations.

The most frequent example of (2) is mis-tuning in the maximum vnode limit of
the system, resulting in the vnode cache exceeding available address space.
Try tuning down that limit. Notice that vnodes, inodes, and most frequently
used file system allocation data types are allocated using uma(9) and not
malloc(9).

Robert N M Watson
Computer Laboratory
University of Cambridge

Kostik Belousov

unread,
May 3, 2007, 12:09:07 AM5/3/07
to
On Wed, May 02, 2007 at 05:28:04PM -0400, Rick Macklem wrote:
>
>
> On Wed, 2 May 2007, Robert Watson wrote:
> [stuff snipped]
> >
> >Historically, such panics have been a result of one of two things:
> >
> >(1) An immediate resource leak in UMA(9) or malloc(9) allocated memory.
> >
> >(2) Mis-tuning of a resource limit, perhaps due to sizing the limit based
> >on
> > solely physical memory size, not taking available kernel address space
> > into account.
> >
> >mti_stats reports only on malloc(9), you need to also look at uma(9),
> >since many frequently allocated types are allocated directly with the slab
> >allocator, and not from kernel malloc. Take a look at the output of "show
> >uma" or "show malloc" in DDB, or respectively "vmstat -z" and "vmstat -m"
> >on a core or on a live system. malloc(9) is actually implemented using
> >two different back-ends: UMA-managed fixed size memory buckets for small
> >allocations, and direct page allocation for large allocations.
>
> Ok, it does appear I'm leaking NAMEIs. "vmstat -z", which I didn't know
> about, was the trick. Handling lookup name buffers is also port specific,
> so it wouldn't have shown up in the other ports.
>
> So, forget what I said w.r.t. a MALLOC bug and thanks for the help. I
> should be able to locate the leak pretty easily with "vmstat -z".
I fixed two NAMI zone leaks in the last 2-3 month. One was in the nfs
server (shall be present in 6.2-RELEASE, AFAIR), second was in UFS
snapshotting code, and is MFCed several days ago.

Dag-Erling Smørgrav

unread,
May 21, 2007, 12:24:29 PM5/21/07
to
Vince <jh...@unsane.co.uk> writes:
> I dont suppose that there are any other tunables people could suggest?

sysctl kern.maxvnodes=50000

Also, disabling atime on all ZFS file systems will greatly improve
performance and reduce the frequency of ATA stalls.

DES
--
Dag-Erling Smørgrav - d...@des.no

_______________________________________________
freebsd...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-curre...@freebsd.org"

--

Darren Reed

unread,
May 22, 2007, 4:31:12 AM5/22/07
to
On Mon, May 21, 2007 at 04:40:31PM +0100, Vince wrote:
...
> I dont suppose that there are any other tunables people could suggest? I
> got a shiny new(well old but new to me) dual opteron board and dual 250
> sata drives and though i'd try putting it in as my home server with
> everything but / on zfs since i've had my /usr/ports on my laptop as
> compressed zfs since very shortly after it was commited.
> After a few kmem_map: too small" panics I re-read this thread and put
> vm.kmem_size_max and vm.kmem_size up to 512M and vfs.zfs.arc_min
> vfs.zfs.arc_max down to 65 megs. This did get me past "portsnap extract"
> but a make buildworld still got me the same panic. vmstat -z showed a
> steady growth. This is with a generic -CURRENT from friday. I'm happy to
> provide any useful information once I get home and reboot it.

Are you running the opterons with a 32 or 64 bit kernel?

I set vfs.zfs.arc_max to somewhere between 75% and 80% of vm.kmem_size_max.

Darren

Solon Luigi Lutz

unread,
May 22, 2007, 10:59:43 AM5/22/07
to
KVS> Pawel Jakub Dawidek <p...@FreeBSD.org> writes:
>>> I may reinstall at a later date as this is still very much a box to play
>>> with, but I gather there is no great gain from going 64 bit other than
>>> not having to play with PAE if you've got lots or RAM.
>>
>> I expect there is a huge difference in performance between i386 and
>> amd64. I'm currently setting up environment to compare ZFS on
>> FreeBSD/i386, FreeBSD/amd64 and Solaris.

KVS> I've had precious little time to do more testing on our amd64-setup, but
KVS> it seems that vm.kmem.size_max is a 32-bit uint, so we can't really use
KVS> much RAM for ZFS.

These are the figures for the following out-of-the-box 7.0 amd64 smp system:

Athlon X2 3800, 1GB Ram, Asus M2N-SLI (modified), 24x 500GB (Samsung Spinpoint),
Areca ARC1280 running RAID-6.

ZFS
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
8000 89555 85.6 188475 63.3 114048 36.9 95865 97.2 460375 64.0 41.2 0.3

Test was done on a 10TB volume.

Since the machine is running in semi-production mode, I can't perform any UFS
tests anymore. The systems is running stable, and extensive
checksumming after 8TB data transfers didn't reveal any errors,
neither did a filesystem-stress-test. Checksumming with 'cfv' runs at a sustained
data rate of 290MB/s

Thanks for the good work!

Pawel Jakub Dawidek

unread,
May 22, 2007, 12:39:20 PM5/22/07
to
On Tue, May 22, 2007 at 04:59:43PM +0200, Solon Luigi Lutz wrote:
> KVS> Pawel Jakub Dawidek <p...@FreeBSD.org> writes:
> >>> I may reinstall at a later date as this is still very much a box to play
> >>> with, but I gather there is no great gain from going 64 bit other than
> >>> not having to play with PAE if you've got lots or RAM.
> >>
> >> I expect there is a huge difference in performance between i386 and
> >> amd64. I'm currently setting up environment to compare ZFS on
> >> FreeBSD/i386, FreeBSD/amd64 and Solaris.
>
> KVS> I've had precious little time to do more testing on our amd64-setup, but
> KVS> it seems that vm.kmem.size_max is a 32-bit uint, so we can't really use
> KVS> much RAM for ZFS.
>
> These are the figures for the following out-of-the-box 7.0 amd64 smp system:

out-of-the-box means that you still have INVARIANTS/WITNESS turned on?

> Athlon X2 3800, 1GB Ram, Asus M2N-SLI (modified), 24x 500GB (Samsung Spinpoint),
> Areca ARC1280 running RAID-6.
>
> ZFS
> -------Sequential Output-------- ---Sequential Input-- --Random--
> -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
> Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
> 8000 89555 85.6 188475 63.3 114048 36.9 95865 97.2 460375 64.0 41.2 0.3
>
> Test was done on a 10TB volume.
>
> Since the machine is running in semi-production mode, I can't perform any UFS
> tests anymore. The systems is running stable, and extensive
> checksumming after 8TB data transfers didn't reveal any errors,
> neither did a filesystem-stress-test. Checksumming with 'cfv' runs at a sustained
> data rate of 290MB/s

:)

Ollivier Robert

unread,
May 23, 2007, 5:04:43 AM5/23/07
to
According to Darren Reed:
> It's not RAM that ZFS really likes but your KVA (Kernel Virtual Address)
> space. With a 32bit kernel you are more likely to experience problems
> with KVA shortage than you are RAM shortage when using ZFS.

It was discussed a bit at the DevSummit in BSDCan and an interesting
question was raised. ZFS has its own buffer cache and tries as best as it
can to bypass our own, taking its toll on memory and KVA. I don't think we
could replace our buffer cache with ARC due to the license but it would be
nice to reduce the duplication there.
--
Ollivier ROBERT -=- FreeBSD: The Power to Serve! -=- rob...@keltia.freenix.fr
Darwin sidhe.keltia.net Kernel Version 8.9.1: Thu Feb 22 20:55:00 PST 2007 i386

It is loading more messages.
0 new messages