When will ZFS become stable?

Ivan Voras

unread,

Jan 4, 2008, 6:44:45 AM1/4/08

to

This is an OpenPGP/MIME signed message (RFC 2440 and 3156)
--------------enig2359F93E64CE8E64AD3DF4B9
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hi,

As far as I know about the details of implementation and what would it
take to fix the problems, is it safe to assume ZFS will never become
stable during 7.x lifetime?

--------------enig2359F93E64CE8E64AD3DF4B9
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.4-svn0 (GNU/Linux)

iD8DBQFHfhuqldnAQVacBcgRArdiAKDlGnUigVzolVxBSUQqMdGBr56AlwCg5xFs
41vO9M9O3zRxOlq7QrKKXzw=
=t+Xr
-----END PGP SIGNATURE-----

--------------enig2359F93E64CE8E64AD3DF4B9--

Brooks Davis

unread,

Jan 4, 2008, 11:35:21 AM1/4/08

to

--7AUc2qLy4jB3hD7Z
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Fri, Jan 04, 2008 at 12:42:28PM +0100, Ivan Voras wrote:
> Hi,
>=20

> As far as I know about the details of implementation and what would it
> take to fix the problems, is it safe to assume ZFS will never become
> stable during 7.x lifetime?

I suppose that depends what you mean by stable. It seems stable enough
for a number of applications today. It's clearly not widely tested
since we haven't shipped a release based on it. It's possible some of
the issues of memory requirements won't be fixable in 7.x, but I don't
think that's a given.

-- Brooks

--7AUc2qLy4jB3hD7Z
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (FreeBSD)

iD8DBQFHfl/wXY6L6fI4GtQRApPDAKCGkc8LaMwoXoLwJNyY1raKCzGspgCff8J2
rlec38tZCAW9t3DN+iUbups=
=qcei
-----END PGP SIGNATURE-----

--7AUc2qLy4jB3hD7Z--

Ivan Voras

unread,

Jan 4, 2008, 12:59:06 PM1/4/08

to

On 04/01/2008, Brooks Davis <bro...@freebsd.org> wrote:
> On Fri, Jan 04, 2008 at 12:42:28PM +0100, Ivan Voras wrote:
> > Hi,
> >

> > As far as I know about the details of implementation and what would it
> > take to fix the problems, is it safe to assume ZFS will never become
> > stable during 7.x lifetime?
>
> I suppose that depends what you mean by stable.

My yardstick is currently "when a month goes by without anyone
complaining it crashed on him" :)

>It seems stable enough
> for a number of applications today.

This number is not so large. It seems to be easily crashed by rsync,
for example (speaking from my own experience, and also some of my
colleagues).

> It's possible some of
> the issues of memory requirements won't be fixable in 7.x, but I don't
> think that's a given.

I listened to some of Pawel's talks and devsummit brainstormings and I
get the feeling *none* of the problems can be fixed in 7.x, especially
on i386. I'm just asking for more official confirmation.

This is not a trivial question, since it involves deploying systems to
be maintained some years into the future - if ZFS will become stable
relatively shortly, it might be worth putting up with crashes, but if
not, there will be no near-future deployments of it.
_______________________________________________
freebsd...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-curre...@freebsd.org"

Brooks Davis

unread,

Jan 4, 2008, 1:13:21 PM1/4/08

to

--XuV1QlJbYrcVoo+x

Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Fri, Jan 04, 2008 at 06:58:32PM +0100, Ivan Voras wrote:
> On 04/01/2008, Brooks Davis <bro...@freebsd.org> wrote:
> > On Fri, Jan 04, 2008 at 12:42:28PM +0100, Ivan Voras wrote:
> > > Hi,
> > >
> > > As far as I know about the details of implementation and what would it
> > > take to fix the problems, is it safe to assume ZFS will never become
> > > stable during 7.x lifetime?
> >
> > I suppose that depends what you mean by stable.

>=20

> My yardstick is currently "when a month goes by without anyone
> complaining it crashed on him" :)

I'm not sure any file system we support meets that criteria...

> >It seems stable enough
> > for a number of applications today.

>=20

> This number is not so large. It seems to be easily crashed by rsync,
> for example (speaking from my own experience, and also some of my
> colleagues).

I saw those crashes early one, but that's 90% of what the mirror server
I'm running does and I'm not seeing them any more. I won't argue
everything is fixed, but ZFS seems much more stable than it was.

> > It's possible some of
> > the issues of memory requirements won't be fixable in 7.x, but I don't
> > think that's a given.

>=20

> I listened to some of Pawel's talks and devsummit brainstormings and I
> get the feeling *none* of the problems can be fixed in 7.x, especially
> on i386. I'm just asking for more official confirmation.

My understanding is that ZFS will never be a great choice on any 32-bit
architecture without major changes Sun probably isn't interested in
making. I think many of the problems people are reporting stem from
that.

> This is not a trivial question, since it involves deploying systems to
> be maintained some years into the future - if ZFS will become stable
> relatively shortly, it might be worth putting up with crashes, but if
> not, there will be no near-future deployments of it.

I don't think anyone is naive enough to say everything will be perfect
by any given date. Reality doesn't work that way. People looking to
deploy ZFS now will need to tolerate a certain amount of risk since it's
never been part of a FreeBSD release (and it's still quite new even in
Solaris). Issues being unfixable in 7.x are one of those risks, but
that's always the case.

-- Brooks

--XuV1QlJbYrcVoo+x
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (FreeBSD)

iD8DBQFHfnb4XY6L6fI4GtQRAtx1AKDiNM7zC3HA80cWMEU52oSZN4JJJQCgkRw6
8HAXT16t60UZ9Lc+hV051JE=
=jo9L
-----END PGP SIGNATURE-----

--XuV1QlJbYrcVoo+x--

Peter Schuller

unread,

Jan 6, 2008, 4:52:28 AM1/6/08

to

--nextPart13184023.iyG8TNZ5Up
Content-Type: text/plain;
charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

> This number is not so large. It seems to be easily crashed by rsync,
> for example (speaking from my own experience, and also some of my
> colleagues).

I can definitely say this is not *generally* true, as I do a lot of=20
rsyncing/rdiff-backup:ing and similar stuff (with many files / large files)=
=20
on ZFS without any stability issues. Problems for me have been limited to=20
32bit and the memory exhaustion issue rather than "hard" issues.

But perhaps that's all you are referring to.

=2D-=20
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller <peter.s...@infidyne.com>'
Key retrieval: Send an E-Mail to getp...@scode.org
E-Mail: peter.s...@infidyne.com Web: http://www.scode.org

--nextPart13184023.iyG8TNZ5Up
Content-Type: application/pgp-signature; name=signature.asc
Content-Description: This is a digitally signed message part.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.4 (FreeBSD)

iD8DBQBHgKSeDNor2+l1i30RAjynAKCl+ehI7a8/xoTBjc9Z5DTcR58obgCcDZF+
qvGLR/LxRjX47PfCCPo29r8=
=Pjj5
-----END PGP SIGNATURE-----

--nextPart13184023.iyG8TNZ5Up--

Ivan Voras

unread,

Jan 6, 2008, 7:59:33 AM1/6/08

to

On 06/01/2008, Peter Schuller <peter.s...@infidyne.com> wrote:
> > This number is not so large. It seems to be easily crashed by rsync,
> > for example (speaking from my own experience, and also some of my
> > colleagues).
>
> I can definitely say this is not *generally* true, as I do a lot of

> rsyncing/rdiff-backup:ing and similar stuff (with many files / large files)

> on ZFS without any stability issues. Problems for me have been limited to

> 32bit and the memory exhaustion issue rather than "hard" issues.

It's not generally true since kmem problems with rsync are often hard
to repeat - I have them on one machine, but not on another, similar
machine. This nonrepeatability is also a part of the problem.

> But perhaps that's all you are referring to.

Mostly. I did have a ZFS crash with rsync that wasn't kmem related,
but only once.

Kris Kennaway

unread,

Jan 6, 2008, 8:08:08 AM1/6/08

to

Ivan Voras wrote:
> On 06/01/2008, Peter Schuller <peter.s...@infidyne.com> wrote:
>>> This number is not so large. It seems to be easily crashed by rsync,
>>> for example (speaking from my own experience, and also some of my
>>> colleagues).
>> I can definitely say this is not *generally* true, as I do a lot of
>> rsyncing/rdiff-backup:ing and similar stuff (with many files / large files)
>> on ZFS without any stability issues. Problems for me have been limited to
>> 32bit and the memory exhaustion issue rather than "hard" issues.
>
> It's not generally true since kmem problems with rsync are often hard
> to repeat - I have them on one machine, but not on another, similar
> machine. This nonrepeatability is also a part of the problem.
>
>> But perhaps that's all you are referring to.
>
> Mostly. I did have a ZFS crash with rsync that wasn't kmem related,
> but only once.

kmem problems are just tuning. They are not indicative of stability
problems in ZFS. Please report any further non-kmem panics you experience.

Kris

Ivan Voras

unread,

Jan 6, 2008, 8:51:43 AM1/6/08

to

This is an OpenPGP/MIME signed message (RFC 2440 and 3156)

--------------enigCE1F2BBCDE2C97088BADBDD9
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable

Kris Kennaway wrote:

> kmem problems are just tuning. They are not indicative of stability=20
> problems in ZFS. =20

I disagree - anything that causes a panic is a stability problem. Panics =

persist AFTER the tunings (for i386 certainly, and there are unsolved=20
reports about it on amd64 also) and are present even when driving kmem=20
size to the maximum. The tunings *can not solve the problems* currently, =

they can only delay the time until they appear, which, by Murphy, often=20
means "sometime around midnight at Saturday". See also the possibility=20
of deadlocks in the ZIL, reported by some users.

> Please report any further non-kmem panics you experience.

I did, once to Pawel and once to the lists. Pawel couldn't help me and=20
nobody responded on the lists. Can you perform a MySQL read-write=20
benchmark on one of the 8-core machines with database on ZFS for about=20
an hour without pause? On a machine with 2 GB (or less) of RAM,=20
preferrably? I've seen problems on i386 but maybe they are also present=20
on amd64.

--------------enigCE1F2BBCDE2C97088BADBDD9

Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHgNy3ldnAQVacBcgRAqzHAKDfZWQg5+0b7chMA8z3yclmReYs6gCeJ3ir
GBVQIzcpbjgFk9JfyTrb1m0=
=1aOH
-----END PGP SIGNATURE-----

--------------enigCE1F2BBCDE2C97088BADBDD9--

Maciej Suszko

unread,

Jan 6, 2008, 9:14:12 AM1/6/08

to

Kris Kennaway wrote:
> Ivan Voras wrote:
> > On 06/01/2008, Peter Schuller <peter.s...@infidyne.com> wrote:
> >>> This number is not so large. It seems to be easily crashed by
> >>> rsync, for example (speaking from my own experience, and also
> >>> some of my colleagues).
> >> I can definitely say this is not *generally* true, as I do a lot of
> >> rsyncing/rdiff-backup:ing and similar stuff (with many files /
> >> large files) on ZFS without any stability issues. Problems for me
> >> have been limited to 32bit and the memory exhaustion issue rather
> >> than "hard" issues.
> >
> > It's not generally true since kmem problems with rsync are often
> > hard to repeat - I have them on one machine, but not on another,
> > similar machine. This nonrepeatability is also a part of the
> > problem.
> >
> >> But perhaps that's all you are referring to.
> >
> > Mostly. I did have a ZFS crash with rsync that wasn't kmem related,
> > but only once.
>

> kmem problems are just tuning. They are not indicative of stability

> problems in ZFS. Please report any further non-kmem panics you
> experience.

I agree that ZFS is pretty stable itself. I use 32bit machine with
2gigs od RAM and all hang cases are kmem related, but the fact is that
I haven't found any way of tuning to stop it crashing. When I do some
rsyncing, especially beetwen different pools - it hangs or reboots -
mostly on bigger files (i.e. rsyncing ports tree with distfiles).
At the moment I patched the kernel with vm_kern.c.2.patch and it just
stopped crashing, but from time to time the machine looks like beeing
freezed for a second or two, after that it works normally.
Have you got any similar experience?
--
regards, Maciej Suszko.

Kris Kennaway

unread,

Jan 6, 2008, 9:28:31 AM1/6/08

to

Ivan Voras wrote:

> Kris Kennaway wrote:
>
>> kmem problems are just tuning. They are not indicative of stability
>> problems in ZFS.
>

> I disagree - anything that causes a panic is a stability problem. Panics

> persist AFTER the tunings (for i386 certainly, and there are unsolved

> reports about it on amd64 also) and are present even when driving kmem

> size to the maximum. The tunings *can not solve the problems* currently,

> they can only delay the time until they appear, which, by Murphy, often

> means "sometime around midnight at Saturday".

That's an assertion directly contradicted by my experience running a
heavily loaded 8-core i386 package builder. Please explain in detail
the steps you have taken to tune your kernel. Do you have the vm_kern.c
patch applied?

> See also the possibility

> of deadlocks in the ZIL, reported by some users.

Yes, this is an outstanding issue. There are a couple of others I run
into in the above configuration, but kmem panics aren't among them.

>> Please report any further non-kmem panics you experience.
>

> I did, once to Pawel and once to the lists. Pawel couldn't help me and

> nobody responded on the lists. Can you perform a MySQL read-write

> benchmark on one of the 8-core machines with database on ZFS for about

> an hour without pause? On a machine with 2 GB (or less) of RAM,

> preferrably? I've seen problems on i386 but maybe they are also present

> on amd64.

I am not set up to test this right now.

Kris

Ivan Voras

unread,

Jan 6, 2008, 9:52:23 AM1/6/08

to

On 06/01/2008, Kris Kennaway <kr...@freebsd.org> wrote:

> That's an assertion directly contradicted by my experience running a
> heavily loaded 8-core i386 package builder.

What is the IO profile of this usage? I'd guess that it's "short
bursts of high activity (archive extraction, installing) followed by
long periods of low activity (compiling)". From what I see on the
lists and somewhat from my own experience, the problem appears more
often when the load is more like "constant high r+w activity",
probably with several users (applications) doing the activity in
parallel.

> Please explain in detail
> the steps you have taken to tune your kernel.

vm.kmem_size="512M"
vm.kmem_size_max="512M"

This should be enough for a 2 GB machine that does other things.

> Do you have the vm_kern.c
> patch applied?

I can confirm that while it delays the panics, it doesn't eliminate
them (this also seems to be the conclusion of several users that have
tested it shortly after it's been posted). The fact that it's not
committed is good enough indication that it's not The Answer.

(And besides, asking users to apply non-committed patches just to run
their systems normally is bad practice :) I can just imagine the
Release Notes: "if you're using ZFS, you'll have to manually patch the
kernel with this patch:..." :)

This close to the -RELEASE, I judge the chances of it being committed are low).

Kris Kennaway

unread,

Jan 6, 2008, 10:10:07 AM1/6/08

to

Ivan Voras wrote:
> On 06/01/2008, Kris Kennaway <kr...@freebsd.org> wrote:
>
>> That's an assertion directly contradicted by my experience running a
>> heavily loaded 8-core i386 package builder.
>
> What is the IO profile of this usage? I'd guess that it's "short
> bursts of high activity (archive extraction, installing) followed by
> long periods of low activity (compiling)". From what I see on the
> lists and somewhat from my own experience, the problem appears more
> often when the load is more like "constant high r+w activity",
> probably with several users (applications) doing the activity in
> parallel.

This is a high I/O environment including lots of parallel activity.

>> Please explain in detail
>> the steps you have taken to tune your kernel.
>
> vm.kmem_size="512M"
> vm.kmem_size_max="512M"
>
> This should be enough for a 2 GB machine that does other things.

No, clearly it is not enough (and you claimed previously to have done
more tuning than this). I have it set to 600MB on the i386 system with
a 1.5GB KVA. Both were necessary.

>> Do you have the vm_kern.c
>> patch applied?
>
> I can confirm that while it delays the panics, it doesn't eliminate
> them (this also seems to be the conclusion of several users that have
> tested it shortly after it's been posted). The fact that it's not
> committed is good enough indication that it's not The Answer.

It is planned to be committed. Pawel has been away for a while.

> (And besides, asking users to apply non-committed patches just to run
> their systems normally is bad practice :) I can just imagine the
> Release Notes: "if you're using ZFS, you'll have to manually patch the
> kernel with this patch:..." :)

ZFS already tells you up front that it's experimental code and likely to
have problems. Users of 7.0-RELEASE should not have unrealistic
expectations.

> This close to the -RELEASE, I judge the chances of it being committed are low).

Perhaps, but that only applies to 7.0-RELEASE.

Kris

Kris Kennaway

unread,

Jan 6, 2008, 10:47:06 AM1/6/08

to

Maciej Suszko wrote:

> Kris Kennaway wrote:
>> Ivan Voras wrote:

>>> On 06/01/2008, Peter Schuller <peter.s...@infidyne.com> wrote:
>>>>> This number is not so large. It seems to be easily crashed by
>>>>> rsync, for example (speaking from my own experience, and also
>>>>> some of my colleagues).
>>>> I can definitely say this is not *generally* true, as I do a lot of
>>>> rsyncing/rdiff-backup:ing and similar stuff (with many files /
>>>> large files) on ZFS without any stability issues. Problems for me
>>>> have been limited to 32bit and the memory exhaustion issue rather
>>>> than "hard" issues.
>>> It's not generally true since kmem problems with rsync are often
>>> hard to repeat - I have them on one machine, but not on another,
>>> similar machine. This nonrepeatability is also a part of the
>>> problem.
>>>
>>>> But perhaps that's all you are referring to.
>>> Mostly. I did have a ZFS crash with rsync that wasn't kmem related,
>>> but only once.

>> kmem problems are just tuning. They are not indicative of stability

>> problems in ZFS. Please report any further non-kmem panics you
>> experience.
>

> I agree that ZFS is pretty stable itself. I use 32bit machine with
> 2gigs od RAM and all hang cases are kmem related, but the fact is that
> I haven't found any way of tuning to stop it crashing. When I do some
> rsyncing, especially beetwen different pools - it hangs or reboots -
> mostly on bigger files (i.e. rsyncing ports tree with distfiles).
> At the moment I patched the kernel with vm_kern.c.2.patch and it just
> stopped crashing, but from time to time the machine looks like beeing
> freezed for a second or two, after that it works normally.
> Have you got any similar experience?

That is expected. That patch makes the system do more work to try and
reclaim memory when it would previously have panicked from lack of
memory. However, the same advice applies as to Ivan: you should try and
tune the memory parameters better to avoid this last-ditch sitation.

Kris

P.S. It sounds like you do not have sufficient debugging configured
either: crashes should produce either a DDB prompt or a coredump so they
can be studied and understood.

Henri Hennebert

unread,

Jan 6, 2008, 10:49:32 AM1/6/08

to

Kris Kennaway wrote:
> Ivan Voras wrote:
>> On 06/01/2008, Peter Schuller <peter.s...@infidyne.com> wrote:
>>>> This number is not so large. It seems to be easily crashed by rsync,
>>>> for example (speaking from my own experience, and also some of my
>>>> colleagues).
>>> I can definitely say this is not *generally* true, as I do a lot of
>>> rsyncing/rdiff-backup:ing and similar stuff (with many files / large
>>> files)
>>> on ZFS without any stability issues. Problems for me have been
>>> limited to
>>> 32bit and the memory exhaustion issue rather than "hard" issues.
>>
>> It's not generally true since kmem problems with rsync are often hard
>> to repeat - I have them on one machine, but not on another, similar
>> machine. This nonrepeatability is also a part of the problem.
>>
>>> But perhaps that's all you are referring to.
>>
>> Mostly. I did have a ZFS crash with rsync that wasn't kmem related,
>> but only once.
>
> kmem problems are just tuning. They are not indicative of stability
> problems in ZFS. Please report any further non-kmem panics you experience.

I encounter 2 times a deadlock during high I/O activity (the last one
during rsync + rm -r on a 5GB hierarchy (openoffice-2/work).

I was running with this patch:
http://people.freebsd.org/~pjd/patches/zgd_done.patch
db> show allpcpu
Current CPU: 1

cpuid = 0
curthread = 0xa5ebe440: pid 3422 "txg_thread_enter"
curpcb = 0xeb175d90
fpcurthread = none
idlethread = 0xa5529aa0: pid 12 "idle: cpu0"
APIC ID = 0
currentldt = 0x50

cpuid = 1
curthread = 0xa56ab220: pid 47 "arc_reclaim_thread"
curpcb = 0xe6837d90
fpcurthread = none
idlethread = 0xa5529880: pid 11 "idle: cpu1"
APIC ID = 1
currentldt = 0x50

With the 2 times arc_reclaim_thread `running`

>
> Kris

Kris Kennaway

unread,

Jan 6, 2008, 11:04:56 AM1/6/08

to

Backtraces of the affected processes (or just alltrace) are usually
required to proceed with debugging, and lock status is also often vital
(show alllocks, requires witness). Also, in the case when threads are
actually running (not deadlocked), then it is often useful to repeatedly
break/continue and sample many backtraces to try and determine where the
threads are looping.

Maciej Suszko

unread,

Jan 6, 2008, 11:07:00 AM1/6/08

to

Kris Kennaway wrote:

> Maciej Suszko wrote:
> > Kris Kennaway wrote:
> >> Ivan Voras wrote:
> >>> On 06/01/2008, Peter Schuller <peter.s...@infidyne.com> wrote:
> >>>>> This number is not so large. It seems to be easily crashed by
> >>>>> rsync, for example (speaking from my own experience, and also
> >>>>> some of my colleagues).
> >>>> I can definitely say this is not *generally* true, as I do a lot
> >>>> of rsyncing/rdiff-backup:ing and similar stuff (with many files /
> >>>> large files) on ZFS without any stability issues. Problems for me
> >>>> have been limited to 32bit and the memory exhaustion issue rather
> >>>> than "hard" issues.
> >>> It's not generally true since kmem problems with rsync are often
> >>> hard to repeat - I have them on one machine, but not on another,
> >>> similar machine. This nonrepeatability is also a part of the
> >>> problem.
> >>>
> >>>> But perhaps that's all you are referring to.
> >>> Mostly. I did have a ZFS crash with rsync that wasn't kmem
> >>> related, but only once.
> >> kmem problems are just tuning. They are not indicative of
> >> stability problems in ZFS. Please report any further non-kmem
> >> panics you experience.
> >

> > I agree that ZFS is pretty stable itself. I use 32bit machine with
> > 2gigs od RAM and all hang cases are kmem related, but the fact is
> > that I haven't found any way of tuning to stop it crashing. When I
> > do some rsyncing, especially beetwen different pools - it hangs or
> > reboots - mostly on bigger files (i.e. rsyncing ports tree with
> > distfiles). At the moment I patched the kernel with
> > vm_kern.c.2.patch and it just stopped crashing, but from time to
> > time the machine looks like beeing freezed for a second or two,
> > after that it works normally. Have you got any similar experience?
>
> That is expected. That patch makes the system do more work to try
> and reclaim memory when it would previously have panicked from lack
> of memory. However, the same advice applies as to Ivan: you should
> try and tune the memory parameters better to avoid this last-ditch
> sitation.

As Ivan said - tuning kmem_size only delay the moment system crash,
earlier or after it happens - that's my point of view.

> P.S. It sounds like you do not have sufficient debugging configured
> either: crashes should produce either a DDB prompt or a coredump so
> they can be studied and understood.

You're right - I turned debugging off, because it's not a production
machine and I can afford such behaviour. Right now, using kernel with
kmem patch applied it's ,,usable''.
--
regards, Maciej Suszko.

Ivan Voras

unread,

Jan 6, 2008, 11:46:11 AM1/6/08

to

This is an OpenPGP/MIME signed message (RFC 2440 and 3156)

--------------enigEDC04B50439687A9573D2846

Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable

Kris Kennaway wrote:

> No, clearly it is not enough=20

This looks like we're constantly chasing the "right amount". Does it=20
depend so much on CPU and IO speed that there's never a generally=20
sufficient "right amount"? So when CPU and drive speed increase, the new =

amount will always be some bigger value?

>(and you claimed previously to have done=20
> more tuning than this).=20

Where? What else is there except kmem tuning (including KVA_PAGES)? IIRC =

Pawel said all other suggested tunings don't do much.

> I have it set to 600MB on the i386 system with=20

> a 1.5GB KVA. Both were necessary.

My point is that the fact that such things are necessary (1.5 GB KVA os=20
a lot on i386) mean that there are serious problems which aren't getting =

fixed since ZFS was imported (that's over 6 months ago).

I see you've added to http://wiki.freebsd.org/ZFSTuningGuide; can you=20
please add the values that work for you to it (especially for KVA_PAGES=20
since the exact kernel configuration line is never spelled out in the=20
document; and say for which hardware are the values known to work)?

> ZFS already tells you up front that it's experimental code and likely t=
o=20
> have problems. =20

I know it's experimental, but requiring users to perform so much tuning=20
just to get it work without crashing will mean it will get a bad=20
reputation early on. Do you (or anyone) know what are the reasons for=20
not having vm.kmem_size to 512 MB by default? Better yet, why not=20
increase both vm.kmem_size and KVA_PAGES to (the equivalent of) 640 MB=20
or 768 MB by default for 7.0?

>Users of 7.0-RELEASE should not have unrealistic
> expectations.

As I've said at the first post of this thread: I'm interested in if it's =

ever going to be stable for 7.x.

--------------enigEDC04B50439687A9573D2846

Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHgQWNldnAQVacBcgRAhigAKDK9bFXJy5Y6nLyyk7Xb98iA57cwQCgrTRz
as4xA3tKTpL2jXpYtGIKDuI=
=dzLD
-----END PGP SIGNATURE-----

--------------enigEDC04B50439687A9573D2846--

Henri Hennebert

unread,

Jan 6, 2008, 11:48:22 AM1/6/08

to

Kris Kennaway wrote:

> Henri Hennebert wrote:
>> Kris Kennaway wrote:
>>> Ivan Voras wrote:
>>>> On 06/01/2008, Peter Schuller <peter.s...@infidyne.com> wrote:
>>>>>> This number is not so large. It seems to be easily crashed by rsync,
>>>>>> for example (speaking from my own experience, and also some of my
>>>>>> colleagues).
>>>>> I can definitely say this is not *generally* true, as I do a lot of
>>>>> rsyncing/rdiff-backup:ing and similar stuff (with many files /
>>>>> large files)
>>>>> on ZFS without any stability issues. Problems for me have been
>>>>> limited to
>>>>> 32bit and the memory exhaustion issue rather than "hard" issues.
>>>>
>>>> It's not generally true since kmem problems with rsync are often hard
>>>> to repeat - I have them on one machine, but not on another, similar
>>>> machine. This nonrepeatability is also a part of the problem.
>>>>
>>>>> But perhaps that's all you are referring to.
>>>>
>>>> Mostly. I did have a ZFS crash with rsync that wasn't kmem related,
>>>> but only once.
>>>
>>> kmem problems are just tuning. They are not indicative of stability
>>> problems in ZFS. Please report any further non-kmem panics you
>>> experience.
>>

>> I encounter 2 times a deadlock during high I/O activity (the last one
>> during rsync + rm -r on a 5GB hierarchy (openoffice-2/work).
>>
>> I was running with this patch:
>> http://people.freebsd.org/~pjd/patches/zgd_done.patch
>> db> show allpcpu
>> Current CPU: 1
>>
>> cpuid = 0
>> curthread = 0xa5ebe440: pid 3422 "txg_thread_enter"
>> curpcb = 0xeb175d90
>> fpcurthread = none
>> idlethread = 0xa5529aa0: pid 12 "idle: cpu0"
>> APIC ID = 0
>> currentldt = 0x50
>>
>> cpuid = 1
>> curthread = 0xa56ab220: pid 47 "arc_reclaim_thread"
>> curpcb = 0xe6837d90
>> fpcurthread = none
>> idlethread = 0xa5529880: pid 11 "idle: cpu1"
>> APIC ID = 1
>> currentldt = 0x50
>>
>> With the 2 times arc_reclaim_thread `running`
>
> Backtraces of the affected processes (or just alltrace) are usually

noted for next time

> required to proceed with debugging, and lock status is also often vital
> (show alllocks, requires witness).

I add it to my kernel config

Also, in the case when threads are
> actually running (not deadlocked), then it is often useful to repeatedly
> break/continue and sample many backtraces to try and determine where the
> threads are looping.

I do this after the second deadlock and arc_reclaim_thread was always
there and second cpu was idle.

Henri
>
> Kris

Ivan Voras

unread,

Jan 6, 2008, 11:51:13 AM1/6/08

to

This is an OpenPGP/MIME signed message (RFC 2440 and 3156)

--------------enig4083928B2011A1784CF8EBD9

Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable

Robert Watson wrote:

> I'm not sure if anyone has mentioned this yet in the thread, but anothe=
r=20
> thing worth taking into account in considering the stability of ZFS is =

> whether or not Sun considers it a production feature in Solaris. Last =
I=20
> heard, it was still considered an experimental feature there as well.

Last I heard, rsync didn't crash Solaris on ZFS :)

--------------enig4083928B2011A1784CF8EBD9

Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHgQY4ldnAQVacBcgRAgPaAJ9VdX3dBVBaTUe6sHPZp8BC7fGUrwCfQTy7
aQdjHiLPe1K5WOwdh67nJ4I=
=Z7cc
-----END PGP SIGNATURE-----

--------------enig4083928B2011A1784CF8EBD9--

Robert Watson

unread,

Jan 6, 2008, 12:09:54 PM1/6/08

to

On Sun, 6 Jan 2008, Ivan Voras wrote:

> Robert Watson wrote:
>
>> I'm not sure if anyone has mentioned this yet in the thread, but another

>> thing worth taking into account in considering the stability of ZFS is

>> whether or not Sun considers it a production feature in Solaris. Last I

>> heard, it was still considered an experimental feature there as well.
>
> Last I heard, rsync didn't crash Solaris on ZFS :)

My admittedly second-hand understanding is that ZFS shows similarly gratuitous
memory use on both Mac OS X and Solaris. One advantage Solaris has is that it
runs primarily on expensive 64-bit servers with lots of memory. Part of the
problem on FreeBSD is that people run ZFS on sytems with 32-bit CPUs and a lot
less memory. It could be that ZFS should be enforcing higher minimum hardware
requirements to mount (i.e., refusing to run on systems with 32-bit address
spaces or <4gb of memory and inadequate tuning).

Robert N M Watson
Computer Laboratory
University of Cambridge

Kris Kennaway

unread,

Jan 6, 2008, 12:12:49 PM1/6/08

to

Ivan Voras wrote:
> Kris Kennaway wrote:
>
>> No, clearly it is not enough
>

> This looks like we're constantly chasing the "right amount". Does it

> depend so much on CPU and IO speed that there's never a generally

> sufficient "right amount"? So when CPU and drive speed increase, the new

> amount will always be some bigger value?

It depends on your workload, which in turn depends on your hardware.
The harder you can drive ZFS the more memory it will require.

>> (and you claimed previously to have done more tuning than this).
>

> Where? What else is there except kmem tuning (including KVA_PAGES)? IIRC

> Pawel said all other suggested tunings don't do much.

Tuning is an interactive process. If 512MB is not enough kmem_map, then
increase it. Repeat as necessary.

>> I have it set to 600MB on the i386 system with a 1.5GB KVA. Both were

>> necessary.
>
> My point is that the fact that such things are necessary (1.5 GB KVA os

> a lot on i386) mean that there are serious problems which aren't getting

> fixed since ZFS was imported (that's over 6 months ago).

ZFS is a memory hog. There is nothing that can really be done about
this, and it is just not a good fit on i386 because of limitations of
the hardware architecture. Note that Sun does not recommend using ZFS
on a 32-bit system either, for the same reasons. It is unlikely this
can really be fixed, although mitigation strategies like the vm_kern.c
patch are possible.

> I see you've added to http://wiki.freebsd.org/ZFSTuningGuide; can you

> please add the values that work for you to it (especially for KVA_PAGES

> since the exact kernel configuration line is never spelled out in the

> document; and say for which hardware are the values known to work)?

OK.

>> ZFS already tells you up front that it's experimental code and likely

>> to have problems.

>
> I know it's experimental, but requiring users to perform so much tuning

> just to get it work without crashing will mean it will get a bad

> reputation early on. Do you (or anyone) know what are the reasons for

> not having vm.kmem_size to 512 MB by default?

Increasing vm.kmem_size.max to 512MB by default has other implications,
but it is something that should be considered.

> Better yet, why not

> increase both vm.kmem_size and KVA_PAGES to (the equivalent of) 640 MB

> or 768 MB by default for 7.0?

That is answered in the tuning guide. Tuning KVA_PAGES by default is
not appropriate.

> >Users of 7.0-RELEASE should not have unrealistic
> > expectations.
>
> As I've said at the first post of this thread: I'm interested in if it's

> ever going to be stable for 7.x.

This was in reply to a comment you made about the vm_kern.c patch
affecting users of 7.0-RELEASE.

Anyway, to sum up, ZFS has known bugs, some of which are unresolved by
the authors, and it is difficult to make it work on i386. It is likely
that the bugs will be fixed over time (obviously), but amd64 will always
be a better choice than i386 for using ZFS because you will not be
continually bumping up against the hardware limitations.

Kris

Kris Kennaway

unread,

Jan 6, 2008, 12:15:05 PM1/6/08

to

To repeat, it is important not just to note which thread is running, but
*what the thread is doing*. This means repeatedly comparing the
backtraces, which will allow you to build up a picture of which part of
the code it is looping in.

Ivan Voras

unread,

Jan 6, 2008, 12:22:30 PM1/6/08

to

This is an OpenPGP/MIME signed message (RFC 2440 and 3156)

--------------enigBF0E8DBB5E5B643C0C0784B4

Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable

Kris Kennaway wrote:

> > Better yet, why not
>> increase both vm.kmem_size and KVA_PAGES to (the equivalent of) 640 MB=
=20

>> or 768 MB by default for 7.0?

>=20
> That is answered in the tuning guide. Tuning KVA_PAGES by default is=20
> not appropriate.

Ok. I'd like to understand what is the relationship between KVA_PAGES=20
and vm.kmem_size. The tuning guide says:

"""By default the kernel receives 1GB of the 4GB of address space=20
available on the i386 architecture, and this is used for all of the=20
kernel address space needs, not just the kmem map. By increasing=20
KVA_PAGES you can allocate a larger proportion of the 4GB address=20
space..."""

and:

"""recompile your kernel with increased KVA_PAGES option, to increase=20
the size of the kernel address space, before vm.kmem_size can be=20
increased beyond 512M"""

What is the other 512 MB of the 1 GB used for?

--------------enigBF0E8DBB5E5B643C0C0784B4

Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHgQ36ldnAQVacBcgRArjoAJ4+6LvxL9bOjNOIHLiCA5rs2o99hACfZlBl
mQUKsJ1MR9ZyUXbn5EoYyKE=
=d0JH
-----END PGP SIGNATURE-----

--------------enigBF0E8DBB5E5B643C0C0784B4--

Ivan Voras

unread,

Jan 6, 2008, 12:29:54 PM1/6/08

to

This is an OpenPGP/MIME signed message (RFC 2440 and 3156)

--------------enig601868FFD7E993E6262A8244

Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable

Robert Watson wrote:
> On Sun, 6 Jan 2008, Ivan Voras wrote:

>> Last I heard, rsync didn't crash Solaris on ZFS :)

>=20
> My admittedly second-hand understanding is that ZFS shows similarly=20
> gratuitous memory use on both Mac OS X and Solaris. One advantage=20
> Solaris has is that it runs primarily on expensive 64-bit servers with =

> lots of memory. Part of the problem on FreeBSD is that people run ZFS =

> on sytems with 32-bit CPUs and a lot less memory. It could be that ZFS=
=20
> should be enforcing higher minimum hardware requirements to mount (i.e.=
,=20
> refusing to run on systems with 32-bit address spaces or <4gb of memory=
=20
> and inadequate tuning).

Solaris nowadays refuses to install on anything without at least 1 GB of =

memory. I'm all for ZFS refusing to run on inadequatly tuned hardware,=20
but apparently there's no algorithmic way to say what *is* adequately=20
tuned, except for "try X and if it crashes, try Y, repeat as necessary".

The reason why I'm arguing this topic is that it isn't a matter of=20
tuning like "it will run slowly if you don't tune it" - it's more like=20
"it won't run at all if you don't go through the laborious=20
trial-and-error process of tuning it, including patching your kernel and =

running a non-GENERIC configuration".

--------------enig601868FFD7E993E6262A8244

Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHgQ+/ldnAQVacBcgRAtnAAKCtMumii9wevIzasHr8NZ6x5aGQ2ACcDmBf
Nqzn2r1T/d1ngFr8i4tyZHU=
=PAma
-----END PGP SIGNATURE-----

--------------enig601868FFD7E993E6262A8244--

Kris Kennaway

unread,

Jan 6, 2008, 12:36:29 PM1/6/08

to

Ivan Voras wrote:

> What is the other 512 MB of the 1 GB used for?

Everything else that the kernel needs address space for. Buffer cache,
mbuf allocation, etc.

Ivan Voras

unread,

Jan 6, 2008, 12:37:20 PM1/6/08

to

This is an OpenPGP/MIME signed message (RFC 2440 and 3156)

--------------enigCF28DF68F34EF7BB67EC6EA2

Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable

Kris Kennaway wrote:

> Ivan Voras wrote:
>> Robert Watson wrote:
>>

>>> I'm not sure if anyone has mentioned this yet in the thread, but=20
>>> another thing worth taking into account in considering the stability =

>>> of ZFS is whether or not Sun considers it a production feature in=20
>>> Solaris. Last I heard, it was still considered an experimental=20
>>> feature there as well.

>>
>> Last I heard, rsync didn't crash Solaris on ZFS :)
>=20

> [Citation needed]

I can't provide citation about a thing that doesn't happen - you don't=20
hear things like "oh and yesterday I ran rsync on my Solaris with ZFS=20
and *it didn't crash*!" often.

But, with some grains of salt taken, consider this Google results:

* searching for "rsync crash solaris zfs": 790 results, most of them=20
obviously irrelevant
* searching for "rsync crash freebsd zfs": 10,800 results; a small=20
number of the results is from this thread, some are duplicates, but it's =

a large number in any case.

I feel that the number of Solaris+ZFS installations worldwide is larger=20
than that of FreeBSD+ZFS and they've had ZFS longer.

--------------enigCF28DF68F34EF7BB67EC6EA2

Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHgRGYldnAQVacBcgRAl22AKCNn8JJNdP7fYk3PNsnRhSGdwEn7gCfYUHc
clThYOP6zkJD3wFZrFqFHXc=
=Et/1
-----END PGP SIGNATURE-----

--------------enigCF28DF68F34EF7BB67EC6EA2--

Kris Kennaway

unread,

Jan 6, 2008, 12:44:40 PM1/6/08

to

Ivan Voras wrote:
> Robert Watson wrote:

>> On Sun, 6 Jan 2008, Ivan Voras wrote:
>

>>> Last I heard, rsync didn't crash Solaris on ZFS :)
>>

>> My admittedly second-hand understanding is that ZFS shows similarly

>> gratuitous memory use on both Mac OS X and Solaris. One advantage

>> Solaris has is that it runs primarily on expensive 64-bit servers with

>> lots of memory. Part of the problem on FreeBSD is that people run ZFS

>> on sytems with 32-bit CPUs and a lot less memory. It could be that

>> ZFS should be enforcing higher minimum hardware requirements to mount
>> (i.e., refusing to run on systems with 32-bit address spaces or <4gb
>> of memory and inadequate tuning).

>
> Solaris nowadays refuses to install on anything without at least 1 GB of

> memory. I'm all for ZFS refusing to run on inadequatly tuned hardware,

> but apparently there's no algorithmic way to say what *is* adequately

> tuned, except for "try X and if it crashes, try Y, repeat as necessary".

What you appear to be still missing is that ZFS also causes memory
exhaustion panics when run on 32-bit Solaris. In fact (unless they have
since fixed it), the opensolaris ZFS code makes *absolutely no attempt*
to accomodate i386 memory limitations at all.

Ivan Voras

unread,

Jan 6, 2008, 1:01:26 PM1/6/08

to

This is an OpenPGP/MIME signed message (RFC 2440 and 3156)

--------------enig0A3831A65593ADD3330D3F5D

Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable

Kris Kennaway wrote:

> What you appear to be still missing is that ZFS also causes memory=20
> exhaustion panics when run on 32-bit Solaris.=20

Citation needed. I'm interested.

--------------enig0A3831A65593ADD3330D3F5D

Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHgRcqldnAQVacBcgRAqiCAKDYA86boV1pvgRfvuZZtppnVuxWyACgq30c
0ZqL8GfJSoOGiutpQHA1S+U=
=ZNrd
-----END PGP SIGNATURE-----

--------------enig0A3831A65593ADD3330D3F5D--

Kris Kennaway

unread,

Jan 6, 2008, 1:12:54 PM1/6/08

to

Ivan Voras wrote:
> Kris Kennaway wrote:

>> Ivan Voras wrote:
>>> Robert Watson wrote:
>>>

>>>> I'm not sure if anyone has mentioned this yet in the thread, but

>>>> another thing worth taking into account in considering the stability

>>>> of ZFS is whether or not Sun considers it a production feature in

>>>> Solaris. Last I heard, it was still considered an experimental
>>>> feature there as well.
>>>

>>> Last I heard, rsync didn't crash Solaris on ZFS :)
>>

>> [Citation needed]
>
> I can't provide citation about a thing that doesn't happen - you don't

> hear things like "oh and yesterday I ran rsync on my Solaris with ZFS

> and *it didn't crash*!" often.
>
> But, with some grains of salt taken, consider this Google results:
>
> * searching for "rsync crash solaris zfs": 790 results, most of them

> obviously irrelevant
> * searching for "rsync crash freebsd zfs": 10,800 results; a small

> number of the results is from this thread, some are duplicates, but it's

> a large number in any case.
>
> I feel that the number of Solaris+ZFS installations worldwide is larger

> than that of FreeBSD+ZFS and they've had ZFS longer.

Almost all Solaris systems are 64 bit.

Patrick Hajek

unread,

Jan 6, 2008, 1:31:52 PM1/6/08

to

> I'm not sure if anyone has mentioned this yet in the thread, but
> another thing worth taking into account in considering the stability of ZFS is
> whether or not Sun considers it a production feature in Solaris. Last I heard,
> it was still considered an experimental feature there as well.

It was not production ready for solaris until they included it the
quarterly release of Solaris 10-- appox a year ago.

-Patrick

--
Patrick Hajek /"\
DOE Joint Genome Institute \ / ASCII RIBBON CAMPAIGN
Desk: 925.296.5762 X HELP CURE HTML MAIL
Cell: 925.997.4826 / \

PGP Fingerprint 688E B579 3449 28B1 DB14 E15A 76BB C1CA A745 9DAB

Ivan Voras

unread,

Jan 6, 2008, 1:50:40 PM1/6/08

to

This is an OpenPGP/MIME signed message (RFC 2440 and 3156)

--------------enig3EB635B71F4C09B8E6DDE3BC

Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable

Kris Kennaway wrote:

> Ivan Voras wrote:
>> Kris Kennaway wrote:
>>
>>> What you appear to be still missing is that ZFS also causes memory=20
>>> exhaustion panics when run on 32-bit Solaris.=20
>>
>> Citation needed. I'm interested.

>=20
> Reports on the zfs-discuss mailing list.

Thanks for the pointer. I'm looking at the archives.

So far I've found this:=20
http://www.archivum.info/zfs-d...@opensolaris.org/2007-07/msg00016.htm=
l=20
which doesn't mention panics;

and this:=20
http://www.archivum.info/zfs-d...@opensolaris.org/2007-07/msg00054.htm=
l=20
which didn't get any replies but the backtrace doesn't include anything=20
resembling a malloc-like call.

--------------enig3EB635B71F4C09B8E6DDE3BC

Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHgSKzldnAQVacBcgRAkGmAKCDYOMbWXcSSaJtuWo7YS8MjSNlcwCePKAq
9q/fY9HjXfbs5Q7jXAgBjuM=
=W2hT
-----END PGP SIGNATURE-----

--------------enig3EB635B71F4C09B8E6DDE3BC--

Gary Corcoran

unread,

Jan 6, 2008, 2:08:36 PM1/6/08

to

Kris Kennaway wrote:
> Ivan Voras wrote:
>> Kris Kennaway wrote:
>>

>>> No, clearly it is not enough
>>
>> This looks like we're constantly chasing the "right amount". Does it
>> depend so much on CPU and IO speed that there's never a generally
>> sufficient "right amount"? So when CPU and drive speed increase, the
>> new amount will always be some bigger value?
>
> It depends on your workload, which in turn depends on your hardware. The
> harder you can drive ZFS the more memory it will require.

As a user, I would expect the above to mean "to continue running quickly".
If it has to slow to a crawl for a moment, due to inadequate memory in
your system, then that's just tough cookies. But crashing (panicing)
is not really acceptable for most people (maybe except a developer).
Again from a user perspective, if ZFS needs "tuning" to run at full speed,
or even at all, I would expect *it* to be able to do a few simple calculations
and do the tuning itself! :-) (even if, in worst case, it requires a clean
shutdown and reboot for the new values to take effect)

The above is not meant as a criticism of the current explicitly-labeled
"experimental" code. Rather, it is what I would hope we might be able
to see sometime over the next year...

>>> (and you claimed previously to have done more tuning than this).
>>
>> Where? What else is there except kmem tuning (including KVA_PAGES)?
>> IIRC Pawel said all other suggested tunings don't do much.
>
> Tuning is an interactive process. If 512MB is not enough kmem_map, then
> increase it. Repeat as necessary.
>
>>> I have it set to 600MB on the i386 system with a 1.5GB KVA. Both
>>> were necessary.
>>
>> My point is that the fact that such things are necessary (1.5 GB KVA
>> os a lot on i386) mean that there are serious problems which aren't
>> getting fixed since ZFS was imported (that's over 6 months ago).
>
> ZFS is a memory hog. There is nothing that can really be done about
> this, and it is just not a good fit on i386 because of limitations of
> the hardware architecture. Note that Sun does not recommend using ZFS
> on a 32-bit system either, for the same reasons. It is unlikely this
> can really be fixed, although mitigation strategies like the vm_kern.c
> patch are possible.

Perhaps the 7.0 release notes should include a note to the effect that
ZFS is *strongly* NOT RECOMMENDED on 32-bit systems at this time, due
to the likelihood of panics. I say this because it sure sounds like
"out of the box" that is what you're most likely to end up with, and
even with manual "corrections" you may still have panics. So why not
just be upfront about it and tell people that, at least at this time,
ZFS is only recommended for 64-bit systems, with a minimum of "N" (2?)
GB of memory? If you were already planning something like this for
the release notes, my apologies.

>> I see you've added to http://wiki.freebsd.org/ZFSTuningGuide; can you
>> please add the values that work for you to it (especially for
>> KVA_PAGES since the exact kernel configuration line is never spelled
>> out in the document; and say for which hardware are the values known
>> to work)?
>
> OK.
>
>>> ZFS already tells you up front that it's experimental code and likely
>>> to have problems.
>>
>> I know it's experimental, but requiring users to perform so much
>> tuning just to get it work without crashing will mean it will get a
>> bad reputation early on. Do you (or anyone) know what are the reasons
>> for not having vm.kmem_size to 512 MB by default?
>
> Increasing vm.kmem_size.max to 512MB by default has other implications,
> but it is something that should be considered.
>

> > Better yet, why not
>> increase both vm.kmem_size and KVA_PAGES to (the equivalent of) 640 MB

>> or 768 MB by default for 7.0?
>

> That is answered in the tuning guide. Tuning KVA_PAGES by default is
> not appropriate.
>
>> >Users of 7.0-RELEASE should not have unrealistic
>> > expectations.
>>
>> As I've said at the first post of this thread: I'm interested in if
>> it's ever going to be stable for 7.x.
>
> This was in reply to a comment you made about the vm_kern.c patch
> affecting users of 7.0-RELEASE.
>
> Anyway, to sum up, ZFS has known bugs, some of which are unresolved by
> the authors, and it is difficult to make it work on i386. It is likely
> that the bugs will be fixed over time (obviously), but amd64 will always
> be a better choice than i386 for using ZFS because you will not be
> continually bumping up against the hardware limitations.

BTW, I am a happy user of ZFS on a 2GB Core2Duo 64-bit system. I never
did any "tuning", it "just worked" for my light-duty file serving needs.
This was from the (I believe) May 2007 snapshot.

Gary

Maciej Suszko

unread,

Jan 6, 2008, 2:57:59 PM1/6/08

to

Kris Kennaway wrote:
> Maciej Suszko wrote:
> > Kris Kennaway wrote:

> >> Maciej Suszko wrote:
> >>> Kris Kennaway wrote:
> >>>> Ivan Voras wrote:

> >>>>> On 06/01/2008, Peter Schuller <peter.s...@infidyne.com>
> >>>>> wrote:
> >>>>>>> This number is not so large. It seems to be easily crashed by
> >>>>>>> rsync, for example (speaking from my own experience, and also
> >>>>>>> some of my colleagues).
> >>>>>> I can definitely say this is not *generally* true, as I do a
> >>>>>> lot of rsyncing/rdiff-backup:ing and similar stuff (with many
> >>>>>> files / large files) on ZFS without any stability issues.
> >>>>>> Problems for me have been limited to 32bit and the memory
> >>>>>> exhaustion issue rather than "hard" issues.
> >>>>> It's not generally true since kmem problems with rsync are often
> >>>>> hard to repeat - I have them on one machine, but not on another,
> >>>>> similar machine. This nonrepeatability is also a part of the
> >>>>> problem.
> >>>>>
> >>>>>> But perhaps that's all you are referring to.
> >>>>> Mostly. I did have a ZFS crash with rsync that wasn't kmem
> >>>>> related, but only once.
> >>>> kmem problems are just tuning. They are not indicative of
> >>>> stability problems in ZFS. Please report any further non-kmem
> >>>> panics you experience.

> >>> I agree that ZFS is pretty stable itself. I use 32bit machine with
> >>> 2gigs od RAM and all hang cases are kmem related, but the fact is
> >>> that I haven't found any way of tuning to stop it crashing. When I
> >>> do some rsyncing, especially beetwen different pools - it hangs or
> >>> reboots - mostly on bigger files (i.e. rsyncing ports tree with
> >>> distfiles). At the moment I patched the kernel with
> >>> vm_kern.c.2.patch and it just stopped crashing, but from time to
> >>> time the machine looks like beeing freezed for a second or two,
> >>> after that it works normally. Have you got any similar experience?
> >> That is expected. That patch makes the system do more work to try
> >> and reclaim memory when it would previously have panicked from lack
> >> of memory. However, the same advice applies as to Ivan: you should
> >> try and tune the memory parameters better to avoid this last-ditch
> >> sitation.
> >
> > As Ivan said - tuning kmem_size only delay the moment system crash,
> > earlier or after it happens - that's my point of view.
>

> So the same question applies: exactly what steps did you take to tune
> the memory parameters? Extracting this information from you guys
> shouldn't be as hard as this :)

I was playing around with kmem_max_size mainly. I suppose messing up
with KVA_PAGES is not a good idea unless you exactly know how much
memory you software consume...
--
regards, Maciej Suszko.

Claus Guttesen

unread,

Jan 6, 2008, 3:26:55 PM1/6/08

to

> >> Last I heard, rsync didn't crash Solaris on ZFS :)
> >
> > [Citation needed]
>
> I can't provide citation about a thing that doesn't happen - you don't
> hear things like "oh and yesterday I ran rsync on my Solaris with ZFS
> and *it didn't crash*!" often.
>
> But, with some grains of salt taken, consider this Google results:
>
> * searching for "rsync crash solaris zfs": 790 results, most of them
> obviously irrelevant
> * searching for "rsync crash freebsd zfs": 10,800 results; a small
> number of the results is from this thread, some are duplicates, but it's
> a large number in any case.
>
> I feel that the number of Solaris+ZFS installations worldwide is larger
> than that of FreeBSD+ZFS and they've had ZFS longer.

I used zfs on FreeBSD current amd64 around summer 2006 as a
samba-server for internal use on a dual xeon (first generation 64-bit,
somewhat slow and hot) with 4 GB ram and two qlogic hba's attached to
approx. 8 TB of storage. I did not once experience any kernel panic or
other unplanned stop. But I whenever I manually mounted a smbfs-share
the terminal would not return to the command line.

I upgraded in october 2007 and the smbfs-mount returned to the command
line and I thought I was happy. Until I started to get the kmem_map
too small kernel-panics when doing much I/O (syncing 40 GB of small
files). I tuned the values as indicated in the zfs tuning guide and
rebooted and increased the values as the kernel panics persisted. When
I increased the values even more I ended up with a kernel which
refused to boot, boy I was almost getting a panic myself :-)

Applying Pawel's patch did make the server survive two or three 40 GB
rsyncing so the patch did help. But we were approching xmas season
which is a very critical time for us so I migrated to solaris 10. The
solaris server has had no downtime but to conclude that solaris is
more stable in my situation is premature.

--
regards
Claus

When lenity and cruelty play for a kingdom,
the gentlest gamester is the soonest winner.

Shakespeare

Kris Kennaway

unread,

Jan 6, 2008, 4:43:27 PM1/6/08

to

Vadim Goncharov wrote:

> 06.01.08 @ 23:34 Kris Kennaway wrote:
>
>>> What is the other 512 MB of the 1 GB used for?
>>
>> Everything else that the kernel needs address space for. Buffer
>> cache, mbuf allocation, etc.
>

> Aren't they allocated from the same memory zones? I have a router with
> 256 Mb RAM, it had a panic with ng_nat once due to exhausted kmem. So,
> what these number from it's sysctl do really mean?
>
> vm.kmem_size: 83415040
> vm.kmem_size_max: 335544320
> vm.kmem_size_scale: 3
> vm.kvm_size: 1073737728
> vm.kvm_free: 704638976
>

I believe that mbufs are allocated from a separate map. In your case
you only have ~80MB available in your kmem_map, which is used for
malloc() in the kernel. It is possible that ng_nat in combination with
the other kernel malloc usage exhausted this relatively small amount of
space without mbuf use being a factor.

Kris

Robert Watson

unread,

Jan 6, 2008, 5:35:15 PM1/6/08

to

On Sun, 6 Jan 2008, Kris Kennaway wrote:

> Vadim Goncharov wrote:
>> 06.01.08 @ 23:34 Kris Kennaway wrote:
>>
>>>> What is the other 512 MB of the 1 GB used for?
>>>
>>> Everything else that the kernel needs address space for. Buffer cache,
>>> mbuf allocation, etc.
>>
>> Aren't they allocated from the same memory zones? I have a router with 256
>> Mb RAM, it had a panic with ng_nat once due to exhausted kmem. So, what
>> these number from it's sysctl do really mean?
>>
>> vm.kmem_size: 83415040
>> vm.kmem_size_max: 335544320
>> vm.kmem_size_scale: 3
>> vm.kvm_size: 1073737728
>> vm.kvm_free: 704638976
>
> I believe that mbufs are allocated from a separate map. In your case you
> only have ~80MB available in your kmem_map, which is used for malloc() in
> the kernel. It is possible that ng_nat in combination with the other kernel
> malloc usage exhausted this relatively small amount of space without mbuf
> use being a factor.

Actually, with mbuma, this has changed -- mbufs are now allocated from the
general kernel map. Pipe buffer memory and a few other things are still
allocated from separate maps, however. In fact, this was one of the known
issues with the introduction of large cluster sizes without resource limits:
address space and memory use were potentially unbounded, so Randall recently
properly implemented the resource limits on mbuf clusters of large sizes.

Robert N M Watson
Computer Laboratory
University of Cambridge

Ivan Voras

unread,

Jan 6, 2008, 5:51:08 PM1/6/08

to

This is an OpenPGP/MIME signed message (RFC 2440 and 3156)

--------------enig157172261597D2CB0F24F254

Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable

Robert Watson wrote:

> Actually, with mbuma, this has changed -- mbufs are now allocated from =

> the general kernel map. Pipe buffer memory and a few other things are =

> still allocated from separate maps, however. In fact, this was one of =

> the known issues with the introduction of large cluster sizes without=20
> resource limits: address space and memory use were potentially=20
> unbounded, so Randall recently properly implemented the resource limits=
=20

> on mbuf clusters of large sizes.

Is this related to reported panics with ZFS and a heavy network load=20
(NFS mostly)?

--------------enig157172261597D2CB0F24F254

Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHgVn1ldnAQVacBcgRArzGAJ96QlxG3VnfACkWe7vvvQkqYc3rZQCgzCPP
K1Yl643Vws2q8emoyKqk/3Y=
=tLNf
-----END PGP SIGNATURE-----

--------------enig157172261597D2CB0F24F254--

Pawel Jakub Dawidek

unread,

Jan 7, 2008, 5:01:14 AM1/7/08

to

--VbJkn9YxBvnuCH5J
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Fri, Jan 04, 2008 at 12:42:28PM +0100, Ivan Voras wrote:
> Hi,
>=20
> As far as I know about the details of implementation and what would it
> take to fix the problems, is it safe to assume ZFS will never become
> stable during 7.x lifetime?

To sum up this thread, let me present ZFS status as of today.

Before I do that, one explanation. I was away from FreeBSD for like 3-4
weeks, because of real life issues, etc. I hope, I'm now back for good.
Let me also use this again to invite any interested committers to help
working on ZFS (I'm inviting people to help from a day one).

Ok.

The most pressing issues currently are:
1. kmem_map exhaustion.
2. Low memory deadlocks in ZFS itself.

I believe 2nd problem is already fixed in OpenSolaris, at least that was
my impression when I made last integration, I need to double check. If
that's true, I'll try to commit the fix before 7.0-RELEASE.

The 1st problem has of course much wider audience. First of all you
need:

http://people.freebsd.org/~pjd/patches/vm_kern.c.2.patch

The patch is not yet committed, because I was discussing better
solutions with alc@. I don't think we (he) will be able to come up with
something better before 7.0-RELEASE, so I'm going to ask re@ for
approval for this patch today. Note that it is low risk change, because
it is executed only in situation where the system will panic anyway.

Of course it is so much better to use ZFS on 64bit systems, but it also
works on i386. I'm running ZFS in production for many months on two i386
systems. One has 1GB memory and those tunning in loader.conf:

vfs.zfs.prefetch_disable=3D1
vm.kmem_size=3D671088640
vm.kmem_size_max=3D671088640

I've three ZFS pools in here, no UFS at all. The load is rather light,
serving large files. No panics.

The second "production" box is my laptop. I've 2GB of RAM (it worked
fine with 1GB too), but I do have 'options KVA_PAGES=3D512' in my kernel
config and my loader.conf looks like this:

vm.kmem_size=3D1073741824
vm.kmem_size_max=3D1073741824
vfs.zfs.prefetch_disable=3D1

My laptop is ZFS-only. No panics whatsoever.

The box I'm running ZFS for the longest time is amd64 system with 1GB of
RAM. This box is used for backups (ZFS snapshots are so damn handy) and
guess what, I'm using rsync for backups:) It also serves files through
NFS:

beast:root:~# showmount -e | wc -l
31

ZFS is used heavly here:

beast:root:~# zfs list -t filesystem | wc -l
50
beast:root:~# zfs list -t snapshot | wc -l
1029

And loader.conf:

vm.kmem_size=3D629145600
vm.kmem_size_max=3D629145600

And again, rock stable.

All my ZFS systems use vm_kern.c.2.patch.

Of course all this doesn't mean ZFS works great on FreeBSD. No. It is
still an experimental feature. I don't agree we should deny mounting ZFS
on i386, etc. We can improve warning and even advise increasing
KVA_PAGES on i386. It's too late to increase vm.kmem_size by default, as
it can affect other parts of the system. ZFS also can't do it
automatically.

In my opinion people are panicing in this thread much more than ZFS:)
Let try to think how we can warn people clearly about proper tunning and
what proper tunning actually means. I think we should advise increasing
KVA_PAGES on i386 and not only vm.kmem_size. We could also warn that
running ZFS on 32bit systems is not generally recommended. Any other
suggestions?

--=20
Pawel Jakub Dawidek http://www.wheel.pl
p...@FreeBSD.org http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!

--VbJkn9YxBvnuCH5J
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.4 (FreeBSD)

iD8DBQFHgfgYForvXbEpPzQRAsTtAKCNMLpRQTwIGupmLeb3hYu+nLCNJQCfazFG
tmrC1xoyabhNW4qZNdSt3uI=
=W2E/
-----END PGP SIGNATURE-----

--VbJkn9YxBvnuCH5J--

Ivan Voras

unread,

Jan 7, 2008, 5:30:55 AM1/7/08

to

On 07/01/2008, Pawel Jakub Dawidek <p...@freebsd.org> wrote:

> Let try to think how we can warn people clearly about proper tunning and
> what proper tunning actually means. I think we should advise increasing
> KVA_PAGES on i386 and not only vm.kmem_size. We could also warn that
> running ZFS on 32bit systems is not generally recommended. Any other
> suggestions?

I'd suggest we do give all three warnings (KVA_PAGES, kmem_size, i386)
at once, preferably both when the ZFS module loads and when a zpool is
created. I think it's important that the tree pieces of information be
given at the same time so the user doesn't need to hunt solutions
after panics.

Your comment that people are panicking more than ZFS is correct, but
that illustrates the importance people give to having file system not
crash on them :)

Dag-Erling Smørgrav

unread,

Jan 7, 2008, 9:05:29 AM1/7/08

to

Ivan Voras <ivo...@freebsd.org> writes:
> As far as I know about the details of implementation and what would it
> take to fix the problems, is it safe to assume ZFS will never become
> stable during 7.x lifetime?

Have you heard of the logical fallacy called "plurium interrogationum"?
You may not be familiar with the phrase (which is Latin for "multiple
questions"), but it's what you're doing here: asking a question which is
impossible to answer truthfully because it is based on an incorrect
premise, and to answer the question correctly you must first discuss the
premise. It's a favorite Hollywood plot device, because you can have
the smart-aleck lawyer interrupt the confused witness and insist on a
yes or no answer, forcing the witness to implicitly agree with the
premise. I doubt it would work in a real-life court, though, because
judges tend to be smart people. But I digress.

Your question is based on the premise that ZFS in FreeBSD 7 is unstable.
That premise is false. There are issues with auto-tuning of certain
parameters, which can cause kmem exhaustion, but they are easily worked
around by setting a few tunables. It has worked very well for me
(raidz, 1.2 TB pool, 4 GB RAM, ~60 file systems and twice as many
snapshots) after I added the following lines to loader.conf:

vm.kmem_size=3D"1G"
vfs.zfs.arc_min=3D"64M"
vfs.zfs.arc_max=3D"512M"

DES
--=20
Dag-Erling Sm=C3=B8rgrav - d...@des.no

Vadim Goncharov

unread,

Jan 7, 2008, 10:18:10 AM1/7/08

to

07.01.08 @ 04:33 Robert Watson wrote:

> On Sun, 6 Jan 2008, Kris Kennaway wrote:
>
>> Vadim Goncharov wrote:
>>> 06.01.08 @ 23:34 Kris Kennaway wrote:
>>>
>>>>> What is the other 512 MB of the 1 GB used for?
>>>> Everything else that the kernel needs address space for. Buffer
>>>> cache, mbuf allocation, etc.
>>> Aren't they allocated from the same memory zones? I have a router
>>> with 256 Mb RAM, it had a panic with ng_nat once due to exhausted
>>> kmem. So, what these number from it's sysctl do really mean?
>>> vm.kmem_size: 83415040
>>> vm.kmem_size_max: 335544320
>>> vm.kmem_size_scale: 3
>>> vm.kvm_size: 1073737728
>>> vm.kvm_free: 704638976
>>
>> I believe that mbufs are allocated from a separate map. In your case
>> you only have ~80MB available in your kmem_map, which is used for
>> malloc() in the kernel. It is possible that ng_nat in combination with
>> the other kernel malloc usage exhausted this relatively small amount of
>> space without mbuf use being a factor.

Yes, in-kernel libalias is "leaking" in sense that it grows unbounded, and
uses malloc(9) instead if it's own UMA zone with settable limits (it frees
all used memory, however, on shutting down ng_nat, so I've done a
workaround restarting ng_nat nodes once a month). But as I see the panic
string:

panic: kmem_malloc(16384): kmem_map too small: 83415040 total allocated

and memory usage in crash dump:

router:~# vmstat -m -M /var/crash/vmcore.32 | grep alias
libalias 241127 30161K - 460568995 128
router:~# vmstat -m -M /var/crash/vmcore.32 | awk '{sum+=$3} END {print
sum}'
50407

...so why only 50 Mb from 80 were used at the moment of panic?

BTW, current memory usage (April 6.2S, ipf w+ 2 ng_nat's) a week after
restart is low:

vadim@router:~>vmstat -m | grep alias
libalias 79542 9983K - 179493840 128
vadim@router:~>vmstat -m | awk '{sum+=$3} END {print sum}'
28124

> Actually, with mbuma, this has changed -- mbufs are now allocated from

> the general kernel map. Pipe buffer memory and a few other things are

> still allocated from separate maps, however. In fact, this was one of

> the known issues with the introduction of large cluster sizes without

> resource limits: address space and memory use were potentially

> unbounded, so Randall recently properly implemented the resource limits

> on mbuf clusters of large sizes.

I still don't understand what that numbers from sysctl above do exactly
mean - sysctl -d for them is obscure. How many memory kernel uses in RAM,
and for which purposes? Is that limit constant? Does kernel swaps out
parts of it, and if yes, how many?

--
WBR, Vadim Goncharov

Robert Watson

unread,

Jan 7, 2008, 10:41:08 AM1/7/08

to

On Mon, 7 Jan 2008, Vadim Goncharov wrote:

> Yes, in-kernel libalias is "leaking" in sense that it grows unbounded, and
> uses malloc(9) instead if it's own UMA zone with settable limits (it frees
> all used memory, however, on shutting down ng_nat, so I've done a workaround
> restarting ng_nat nodes once a month). But as I see the panic string:

Did you have any luck raising interest from Paulo regarding this problem? Is
there a PR I can take a look at? I'm not really familiar with the code, so
I'd prefer someone who was a bit more familiar with it looked after it, but I
can certainly take a glance.

> panic: kmem_malloc(16384): kmem_map too small: 83415040 total allocated
>
> and memory usage in crash dump:
>
> router:~# vmstat -m -M /var/crash/vmcore.32 | grep alias
> libalias 241127 30161K - 460568995 128
> router:~# vmstat -m -M /var/crash/vmcore.32 | awk '{sum+=$3} END {print sum}'
> 50407
>
> ...so why only 50 Mb from 80 were used at the moment of panic?

This is a bit complicated to answer, but I'll try to capture the gist in a
short space.

The kernel memory map is an address space in which pages can be placed to be
used by the kernel. Those pages are often allocated using one of two kernel
allocators, malloc(9) which does variable sized memory allocations, and uma(9)
which is a slab allocator and supports caching of complex but fixed-size
objects. Temporary buffers of variable size or infrequently allocated objects
will use malloc, but frequently allocated objects of fixed size (vnods, mbufs,
...) will use uma. "vmstat -m" prints out information on malloc allocations,
and "vmstat -z" prints out information on uma allocations.

To make life slightly more complicated, small malloc allocations are actually
implemented using uma -- there are a small number of small object size zones
reserved for this purpose, and malloc just rounds up to the next such bucket
size and allocations from that bucket. For larger sizes, malloc goes through
uma, but pretty much directly to VM which makes pages available directly. So
when you look at "vmstat -z" output, be aware that some of the information
presented there (zones named things like "128", "256", etc) are actually the
pools from which malloc allocations come, so there's double-counting.

There are also other ways to get memory into the kernel map, such as directly
inserting pages from user memory into the kernel address space in order to
implement zero-copy. This is done, for example, when zero-copy sockets are
used.

To make life just very slightly more complicated even, I'll tell you that
there are something called "submaps" in the kernel memory map, which have
special properties. One of these is used for mapping the buffer cache.
Another is used for mapping pageable memory used as part of copy-reduction in
the pipe(2) code. Rather than copying twice (into the kernel and out again)
in the pipe code, for large pipe I/O we will borrow the user pages from the
sending process, mapping them into the kernel and hooking them up to the pipe.

> BTW, current memory usage (April 6.2S, ipf w+ 2 ng_nat's) a week after
> restart is low:
>
> vadim@router:~>vmstat -m | grep alias
> libalias 79542 9983K - 179493840 128
> vadim@router:~>vmstat -m | awk '{sum+=$3} END {print sum}'
> 28124
>
>> Actually, with mbuma, this has changed -- mbufs are now allocated from the
>> general kernel map. Pipe buffer memory and a few other things are still
>> allocated from separate maps, however. In fact, this was one of the known
>> issues with the introduction of large cluster sizes without resource
>> limits: address space and memory use were potentially unbounded, so Randall
>> recently properly implemented the resource limits on mbuf clusters of large
>> sizes.
>
> I still don't understand what that numbers from sysctl above do exactly mean
> - sysctl -d for them is obscure. How many memory kernel uses in RAM, and for
> which purposes? Is that limit constant? Does kernel swaps out parts of it,
> and if yes, how many?

The concept of kernel memory, as seen above, is a bit of a convoluted concept.
Simple memory allocated by the kernel for its internal data structures, such
as vnodes, sockets, mbufs, etc, is almost always not something that can be
paged, as it may be accessed from contexts where blocking on I/O is not
permitted (for example, in interrupt threads or with critical mutexes held).
However, other memory in the kernel map may well be pageable, such as kernel
thread stacks for sleeping user threads (which can be swapped out under heavy
memory load), pipe buffers, and general cached data for the buffer cache /
file system, which will be paged out or discarded when memory pressure goes
up.

When debugging a kernel memory leak in the network stack, the usual starting
point is to look at vmstat -m and vmstat -z to see what type of memory is
being leaked. The really big monotonically growing type is usually the one
that's at fault. Often it's the one being allocated when the system runs out
of address space or memory, so sometimes even a simple backtrace will identify
the culprit.

Robert N M Watson
Computer Laboratory
University of Cambridge

Andrew Thompson

unread,

Jan 7, 2008, 3:47:34 PM1/7/08

to

On Mon, Jan 07, 2008 at 03:37:26PM +0100, Ivan Voras wrote:

> On 07/01/2008, Dag-Erling Sm??rgrav <d...@des.no> wrote:
>
> > Your question is based on the premise that ZFS in FreeBSD 7 is unstable.
> > That premise is false.
>

> At most, we'll have to agree to disagree. A "tuning" of the system (at
> least from my experience) is about system performance, not whether the
> system will crash or not. You may define the word to mean something
> else but that's your thing.
>
> The reason I'm aggressively discussing this is that labeling the
> problem as "tuning" will, for any non-trivial task which has some
> growth in system load, result in a server that needs constant tuning
> just to survive another day. What is tuned today may as well result in
> a crash tomorrow if the load rises. Web servers are notorious for this
> (though other types have of course similar behaviour) - a
> "slashdotting" of a "properly tuned" FreeBSD system with ZFS will not
> result in a slowdown - it will result in the system crashing. This is
> not acceptable, and therefore dismissing it as "just tuning" is
> counterproductive and bad engineering.

ZFS is clearly marked as experimental so its reasonable to require tuning
to avoid crashes. If its still the case when the experimental status is
lifted then you can have this argument all over again.

cheers,
Andrew

Vadim Goncharov

unread,

Jan 7, 2008, 6:29:31 PM1/7/08

to

07.01.08 @ 21:39 Robert Watson wrote:

> On Mon, 7 Jan 2008, Vadim Goncharov wrote:
>
>> Yes, in-kernel libalias is "leaking" in sense that it grows unbounded,
>> and uses malloc(9) instead if it's own UMA zone with settable limits
>> (it frees all used memory, however, on shutting down ng_nat, so I've
>> done a workaround restarting ng_nat nodes once a month). But as I see
>> the panic string:
>
> Did you have any luck raising interest from Paulo regarding this
> problem? Is there a PR I can take a look at? I'm not really familiar
> with the code, so I'd prefer someone who was a bit more familiar with it
> looked after it, but I can certainly take a glance.

No, i didn't do that yet. Brief search, however, shows kern/118432, though
it is not directly kmem issue, and also thread
http://209.85.135.104/search?q=cache:lpXLlrtojg8J:archive.netbsd.se/%3Fml%3Dfreebsd-net%26a%3D2006-10%26t%3D2449333+ng_nat+panic+memory&hl=ru&ct=clnk&cd=9&client=opera
in which memory exhaustion problem was predicted. Also, I've heard some
rumors about ng_nat memory panics under very heavy load, but a man with
300Mbps router with several ng_nat's said his router is rock stable for
half a year - though his router has 1 Gb of RAM and mine only 256 Mb (BTW,
it's his system that has crashed recently with kern/118993, but this is
not ng_nat kmem issue, as I think).

>> panic: kmem_malloc(16384): kmem_map too small: 83415040 total allocated
>>
>> and memory usage in crash dump:
>>
>> router:~# vmstat -m -M /var/crash/vmcore.32 | grep alias
>> libalias 241127 30161K - 460568995 128
>> router:~# vmstat -m -M /var/crash/vmcore.32 | awk '{sum+=$3} END {print
>> sum}'
>> 50407
>>
>> ...so why only 50 Mb from 80 were used at the moment of panic?
>
> This is a bit complicated to answer, but I'll try to capture the gist in
> a short space.
>
> The kernel memory map is an address space in which pages can be placed
> to be used by the kernel. Those pages are often allocated using one of
> two kernel allocators, malloc(9) which does variable sized memory
> allocations, and uma(9) which is a slab allocator and supports caching
> of complex but fixed-size objects. Temporary buffers of variable size
> or infrequently allocated objects will use malloc, but frequently

> allocated objects of fixed size (vnods, mbufs, ....) will use uma.

> "vmstat -m" prints out information on malloc allocations, and "vmstat
> -z" prints out information on uma allocations.
>
> To make life slightly more complicated, small malloc allocations are
> actually implemented using uma -- there are a small number of small
> object size zones reserved for this purpose, and malloc just rounds up
> to the next such bucket size and allocations from that bucket. For
> larger sizes, malloc goes through uma, but pretty much directly to VM
> which makes pages available directly. So when you look at "vmstat -z"
> output, be aware that some of the information presented there (zones
> named things like "128", "256", etc) are actually the pools from which
> malloc allocations come, so there's double-counting.

Yes, I've known it, but didn't known what column names exactly mean.
Requests/Failures, I guess, is a pure statistics, Size is one element
size, but why USED + FREE != LIMIT (on whose where limit is non-zero) ?

> There are also other ways to get memory into the kernel map, such as
> directly inserting pages from user memory into the kernel address space
> in order to implement zero-copy. This is done, for example, when
> zero-copy sockets are used.

Last time I've tried it on 5.4 it caused panics every several hours on my
fileserver, so I thought this feature is not of wide use...

> To make life just very slightly more complicated even, I'll tell you
> that there are something called "submaps" in the kernel memory map,
> which have special properties. One of these is used for mapping the
> buffer cache. Another is used for mapping pageable memory used as part
> of copy-reduction in the pipe(2) code. Rather than copying twice (into
> the kernel and out again) in the pipe code, for large pipe I/O we will
> borrow the user pages from the sending process, mapping them into the
> kernel and hooking them up to the pipe.

So, is the kernel memory map global thing that covers entire kernel or
there several maps in kernel, say, one for malloc(), one for other UMA,
etc. ? Recalling sysctl values from my previous message:

vm.kmem_size: 83415040
vm.kmem_size_max: 335544320
vm.kmem_size_scale: 3
vm.kvm_size: 1073737728
vm.kvm_free: 704638976

So, kvm_size looks like amount of KVA_PAGES, covering entire kernel
address space, plugged to every process' address space. But more than 300
megs are used, while machine has only 256 Mb of RAM. I see line in top:

Mem: 41M Active, 1268K Inact, 102M Wired, 34M Buf, 94M Free

I guess 34M buffer cache is entirely in-kernel memory, is this part of
kmem_size or another part of kernel space? What does kmem_size_max and
kmem_size_scale do - can kmem grow dynamically? Does kmem_size of about 80
megs mean that 80 megs of RAM is constantly used by kernel for it's needs,
including buffer cache, and other 176 megs are spent for processes RSS, or
relation is more complicated?

>> BTW, current memory usage (April 6.2S, ipf w+ 2 ng_nat's) a week after
>> restart is low:
>>
>> vadim@router:~>vmstat -m | grep alias
>> libalias 79542 9983K - 179493840 128
>> vadim@router:~>vmstat -m | awk '{sum+=$3} END {print sum}'
>> 28124
>>
>>> Actually, with mbuma, this has changed -- mbufs are now allocated from
>>> the general kernel map. Pipe buffer memory and a few other things are
>>> still allocated from separate maps, however. In fact, this was one of
>>> the known issues with the introduction of large cluster sizes without
>>> resource limits: address space and memory use were potentially
>>> unbounded, so Randall recently properly implemented the resource
>>> limits on mbuf clusters of large sizes.
>>
>> I still don't understand what that numbers from sysctl above do exactly
>> mean - sysctl -d for them is obscure. How many memory kernel uses in
>> RAM, and for which purposes? Is that limit constant? Does kernel swaps
>> out parts of it, and if yes, how many?
>
> The concept of kernel memory, as seen above, is a bit of a convoluted
> concept. Simple memory allocated by the kernel for its internal data
> structures, such as vnodes, sockets, mbufs, etc, is almost always not
> something that can be paged, as it may be accessed from contexts where
> blocking on I/O is not permitted (for example, in interrupt threads or
> with critical mutexes held). However, other memory in the kernel map may
> well be pageable, such as kernel thread stacks for sleeping user threads

We can assume for simplicty that their memoru is not-so-kernel but part of
process address space :)

> (which can be swapped out under heavy memory load), pipe buffers, and
> general cached data for the buffer cache / file system, which will be
> paged out or discarded when memory pressure goes up.

Umm. I think there is no point in swapping disk cache which can be
discarded, so the most actual part of kernel memory which is swappable are
anonymous pipe(2) buffers?

> When debugging a kernel memory leak in the network stack, the usual
> starting point is to look at vmstat -m and vmstat -z to see what type of
> memory is being leaked. The really big monotonically growing type is
> usually the one that's at fault. Often it's the one being allocated
> when the system runs out of address space or memory, so sometimes even a
> simple backtrace will identify the culprit.

OK, here are the zone state from the crash dump:

router:~# vmstat -z -M /var/crash/vmcore.32
ITEM SIZE LIMIT USED FREE REQUESTS
FAILURES

UMA Kegs: 140, 0, 88, 8,
88, 0
UMA Zones: 120, 0, 88, 2,
88, 0
UMA Slabs: 64, 0, 5020, 54,
15454953, 0
UMA RCntSlabs: 104, 0, 1500, 165,
1443452, 0
UMA Hash: 128, 0, 3, 27,
6, 0
16 Bucket: 76, 0, 19, 31,
34, 0
32 Bucket: 140, 0, 24, 4,
58, 0
64 Bucket: 268, 0, 14, 28,
125, 177
128 Bucket: 524, 0, 449, 97, 415988,
109049
VM OBJECT: 132, 0, 2124, 13217,
37014938, 0
MAP: 192, 0, 7, 33,
7, 0
KMAP ENTRY: 68, 15512, 24, 2440,
67460011, 0
MAP ENTRY: 68, 0, 1141, 483,
67039931, 0
PV ENTRY: 24, 452400, 25801, 23499,
784683549, 0
DP fakepg: 72, 0, 0, 0,
0, 0
mt_zone: 64, 0, 237, 58,
237, 0
16: 16, 0, 2691, 354,
21894973014, 0
32: 32, 0, 2281, 318,
35838274034, 0
64: 64, 0, 6098, 1454,
172769061, 0
128: 128, 0, 243914, 16846,
637135440, 4
256: 256, 0, 978, 222,
134799637, 0
512: 512, 0, 196, 116,
3216246, 0
1024: 1024, 0, 67, 73,
366070, 0
2048: 2048, 0, 8988, 46,
69855367, 7
4096: 4096, 0, 155, 29,
1894695, 0
Files: 72, 0, 270, 207,
31790371, 0
PROC: 536, 0, 96, 37,
1567418, 0
THREAD: 376, 0, 142, 8,
14326845, 0
KSEGRP: 88, 0, 137, 63,
662, 0
UPCALL: 44, 0, 6, 150,
536, 0
VMSPACE: 296, 0, 48, 56,
1567372, 0
audit_record: 828, 0, 0, 0,
0, 0
mbuf_packet: 256, 0, 591, 121,
208413611538, 0
mbuf: 256, 0, 1902, 1226,
202203273445, 0
mbuf_cluster: 2048, 8768, 2537, 463,
5247493815, 2
mbuf_jumbo_pagesize: 4096, 0, 0, 0,
0, 0
mbuf_jumbo_9k: 9216, 0, 0, 0,
0, 0
mbuf_jumbo_16k: 16384, 0, 0, 0,
0, 0
ACL UMA zone: 388, 0, 0, 0,
0, 0
NetGraph items: 36, 546, 0, 546,
251943928450, 1170428
g_bio: 132, 0, 1, 231,
336628343, 0
ata_request: 204, 0, 1, 316,
82269680, 0
ata_composite: 196, 0, 0, 0,
0, 0
VNODE: 272, 0, 2039, 14523,
40154724, 0
VNODEPOLL: 76, 0, 0, 50,
1, 0
S VFS Cache: 68, 0, 2247, 12929,
41383752, 0
L VFS Cache: 291, 0, 0, 364,
536802, 0
NAMEI: 1024, 0, 372, 12,
126634007, 0
NFSMOUNT: 480, 0, 0, 0,
0, 0
NFSNODE: 460, 0, 0, 0,
0, 0
DIRHASH: 1024, 0, 156, 184,
131252, 0
PIPE: 408, 0, 24, 30,
822603, 0
KNOTE: 68, 0, 0, 112,
249530, 0
bridge_rtnode: 32, 0, 0, 0,
0, 0
socket: 356, 8778, 75, 35,
1488596, 0
ipq: 32, 339, 0, 226,
58472202, 0
udpcb: 180, 8778, 17, 49,
239035, 0
inpcb: 180, 8778, 23, 109,
676919, 0
tcpcb: 464, 8768, 22, 34,
676919, 0
tcptw: 48, 1794, 1, 233,
177851, 0
syncache: 100, 15366, 0, 78,
610893, 0
hostcache: 76, 15400, 78, 72,
13137, 0
tcpreass: 20, 676, 0, 169,
48826, 0
sackhole: 20, 0, 0, 169,
194, 0
ripcb: 180, 8778, 4, 40,
142316, 0
unpcb: 144, 8775, 19, 62,
393432, 0
rtentry: 132, 0, 480, 187,
448160, 0
pfsrctrpl: 100, 0, 0, 0,
0, 0
pfrulepl: 604, 0, 0, 0,
0, 0
pfstatepl: 260, 10005, 0, 0,
0, 0
pfaltqpl: 128, 0, 0, 0,
0, 0
pfpooladdrpl: 68, 0, 0, 0,
0, 0
pfrktable: 1240, 0, 0, 0,
0, 0
pfrkentry: 156, 0, 0, 0,
0, 0
pfrkentry2: 156, 0, 0, 0,
0, 0
pffrent: 16, 5075, 0, 0,
0, 0
pffrag: 48, 0, 0, 0,
0, 0
pffrcache: 48, 10062, 0, 0,
0, 0
pffrcent: 12, 50141, 0, 0,
0, 0
pfstatescrub: 28, 0, 0, 0,
0, 0
pfiaddrpl: 92, 0, 0, 0,
0, 0
pfospfen: 108, 0, 0, 0,
0, 0
pfosfp: 28, 0, 0, 0,
0, 0
IPFW dynamic rule zone: 108, 0, 147, 393,
20301589, 0
divcb: 180, 8778, 2, 42,
45, 0
SWAPMETA: 276, 30548, 2257, 473,
348836, 0
Mountpoints: 664, 0, 8, 10,
100, 0
FFS inode: 132, 0, 2000, 6468,
40152792, 0
FFS1 dinode: 128, 0, 0, 0,
0, 0
FFS2 dinode: 256, 0, 2000, 3730,
40152792, 0

--
WBR, Vadim Goncharov

Robert Watson

unread,

Jan 7, 2008, 6:40:10 PM1/7/08

to

On Tue, 8 Jan 2008, Vadim Goncharov wrote:

>> To make life slightly more complicated, small malloc allocations are
>> actually implemented using uma -- there are a small number of small object
>> size zones reserved for this purpose, and malloc just rounds up to the next
>> such bucket size and allocations from that bucket. For larger sizes,
>> malloc goes through uma, but pretty much directly to VM which makes pages
>> available directly. So when you look at "vmstat -z" output, be aware that
>> some of the information presented there (zones named things like "128",
>> "256", etc) are actually the pools from which malloc allocations come, so
>> there's double-counting.
>
> Yes, I've known it, but didn't known what column names exactly mean.
> Requests/Failures, I guess, is a pure statistics, Size is one element size,
> but why USED + FREE != LIMIT (on whose where limit is non-zero) ?

Possibly we should rename the "FREE" column to "CACHE" -- the free count is
the number of items in the UMA cache. These may be hung in buckets off the
per-CPU cache, or be spare buckets in the zone. Either way, the memory has to
be reclaimed before it can be used for other purposes, and generally for
complex objects, it can be allocated much more quickly than going back to VM
for more memory. LIMIT is an administrative limit that may be configured on
the zone, and is configured for some but not all zones.

I'll let someone with a bit more VM experience follow up with more information
about how the various maps and submaps relate to each other.

>> The concept of kernel memory, as seen above, is a bit of a convoluted
>> concept. Simple memory allocated by the kernel for its internal data
>> structures, such as vnodes, sockets, mbufs, etc, is almost always not
>> something that can be paged, as it may be accessed from contexts where
>> blocking on I/O is not permitted (for example, in interrupt threads or with
>> critical mutexes held). However, other memory in the kernel map may well be
>> pageable, such as kernel thread stacks for sleeping user threads
>
> We can assume for simplicty that their memoru is not-so-kernel but part of
> process address space :)

If it is mapped into the kernel address space, then it still counts towards
the limit on the map. There are really two critical resources: memory itself,
and address space to map it into. Over time, the balance between address
space and memory changes -- for a long time, 32 bits was the 640k of the UNIX
world, so there was always plenty of address space and not enough memory to
fill it. More recently, physical memory started to overtake address space,
and now with the advent of widely available 64-bit systems, it's swinging in
the other direction. The trick is always in how to tune things, as tuning
parameters designed for "memory is bounded and address space is infinite"
often work less well when that's not the case. In the early 5.x series, we
had a lot of kernel panics because kernel constants were scaling to physical
memory rather than address space, so the kernel would run out of address
space, for example.

>> (which can be swapped out under heavy memory load), pipe buffers, and
>> general cached data for the buffer cache / file system, which will be paged
>> out or discarded when memory pressure goes up.
>
> Umm. I think there is no point in swapping disk cache which can be
> discarded, so the most actual part of kernel memory which is swappable are
> anonymous pipe(2) buffers?

Yes, that's what I meant. There are some other types of pageable kernel
memory, such as memory used for swap-backed md devices.

Robert N M Watson
Computer Laboratory
University of Cambridge

Robert Watson

unread,

Jan 8, 2008, 4:21:11 AM1/8/08

to

On Sun, 6 Jan 2008, Ivan Voras wrote:

> Robert Watson wrote:
>
>> Actually, with mbuma, this has changed -- mbufs are now allocated from the
>> general kernel map. Pipe buffer memory and a few other things are still
>> allocated from separate maps, however. In fact, this was one of the known
>> issues with the introduction of large cluster sizes without resource
>> limits: address space and memory use were potentially unbounded, so Randall
>> recently properly implemented the resource limits on mbuf clusters of large
>> sizes.
>

> Is this related to reported panics with ZFS and a heavy network load (NFS
> mostly)?

Handling resource exhaustion is a tricky issue, because sometimes it takes
resources to make resources available. In the presence of a really greedy
(that is to say, effectively leaking) subsystem, there isn't really any way to
recover. There are really two alternatives: deadlock (no resources are
available, so no progress can be made) or panic (no resources are available so
do the only thing we can). Subsystems are relied upon to impose their own
limits, or at least provide those limits to UMA so that UMA can impose them,
as "appropriate" limits are entirely dependent on context. It's indeed the
case that the more load the system is under, the more resources are in use,
and therefore the lower the threshold for any particular system to contribute
to a potential exhaustion of resources. If the network is at a very high
watermark, then indeed ZFS has to use less to exhaust it.

Normally, subsystems like the network stack and file systems rely on "back
pressure" to cause them to release memory -- the network stack largely
allocates using UMA, so the VM low memory event frees up its caches, and it
also implements its own per-protocol low memory handlers, doing things like
discarding TCP reassembly buffers, etc. VM also knows to discard un-dirtied
pages. Pawel has a patch to make ZFS more agressively call low memory event
handlers when it gets a bit too greedy, which I saw in the re@ MFC queue
yesterday, it you might find this improves behavior a bit more. However,
things do get quite tricky when you're low on resources, because you waiting
indefinitely for resources rather than panicking may actually be worse,
because the system may never recover. That's why constaining initial resource
and responding to back pressure early is critical, in order to avoid getting
into situations where the only possible response is to hang or panic.

There's an interesting paper by Gibson, et al, from CMU on economic models for
"investing" memory pages in different sorts of cache -- prefetch, read-ahead,
buffer cache, etc, and is a good read for getting a grasp of just how tricky
the balance is to find.

Oliver Fromme

unread,

Jan 8, 2008, 12:59:40 PM1/8/08

to

Ivan Voras wrote:

> Pawel Jakub Dawidek wrote:
>
> > Let try to think how we can warn people clearly about proper tunning and
> > what proper tunning actually means. I think we should advise increasing
> > KVA_PAGES on i386 and not only vm.kmem_size. We could also warn that
> > running ZFS on 32bit systems is not generally recommended. Any other
> > suggestions?
>
> I'd suggest we do give all three warnings (KVA_PAGES, kmem_size, i386)
> at once, preferably both when the ZFS module loads and when a zpool is
> created. I think it's important that the tree pieces of information be
> given at the same time so the user doesn't need to hunt solutions
> after panics.

How about including the URL of the ZFS tuning guide in the
warning message:

http://wiki.freebsd.org/ZFSTuningGuide

It contains all the necessary information for both i386 and
amd64 machines. It can also easily be updated if necessary
so people always get the most up-to-date information.

Best regards
Oliver

--
Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing b. M.
Handelsregister: Registergericht Muenchen, HRA 74606, Geschäftsfuehrung:
secnetix Verwaltungsgesellsch. mbH, Handelsregister: Registergericht Mün-
chen, HRB 125758, Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart

FreeBSD-Dienstleistungen, -Produkte und mehr: http://www.secnetix.de/bsd

"Documentation is like sex; when it's good, it's very, very good,
and when it's bad, it's better than nothing."
-- Dick Brandon

Steve Kargl

unread,

Jan 8, 2008, 1:05:04 PM1/8/08

to

On Tue, Jan 08, 2008 at 06:58:47PM +0100, Oliver Fromme wrote:
> Ivan Voras wrote:
> > Pawel Jakub Dawidek wrote:
> >
> > > Let try to think how we can warn people clearly about proper tunning and
> > > what proper tunning actually means. I think we should advise increasing
> > > KVA_PAGES on i386 and not only vm.kmem_size. We could also warn that
> > > running ZFS on 32bit systems is not generally recommended. Any other
> > > suggestions?
> >
> > I'd suggest we do give all three warnings (KVA_PAGES, kmem_size, i386)
> > at once, preferably both when the ZFS module loads and when a zpool is
> > created. I think it's important that the tree pieces of information be
> > given at the same time so the user doesn't need to hunt solutions
> > after panics.
>
> How about including the URL of the ZFS tuning guide in the
> warning message:
>
> http://wiki.freebsd.org/ZFSTuningGuide
>
> It contains all the necessary information for both i386 and
> amd64 machines. It can also easily be updated if necessary
> so people always get the most up-to-date information.
>

The tuning information belongs in the zfs(8) manual page.

--
Steve

Vadim Goncharov

unread,

Jan 8, 2008, 1:59:53 PM1/8/08

to

08.01.08 @ 05:39 Robert Watson wrote:

> On Tue, 8 Jan 2008, Vadim Goncharov wrote:
>
>>> To make life slightly more complicated, small malloc allocations are
>>> actually implemented using uma -- there are a small number of small
>>> object size zones reserved for this purpose, and malloc just rounds up
>>> to the next such bucket size and allocations from that bucket. For
>>> larger sizes, malloc goes through uma, but pretty much directly to VM
>>> which makes pages available directly. So when you look at "vmstat -z"
>>> output, be aware that some of the information presented there (zones
>>> named things like "128", "256", etc) are actually the pools from which
>>> malloc allocations come, so there's double-counting.
>>
>> Yes, I've known it, but didn't known what column names exactly mean.
>> Requests/Failures, I guess, is a pure statistics, Size is one element
>> size, but why USED + FREE != LIMIT (on whose where limit is non-zero) ?
>
> Possibly we should rename the "FREE" column to "CACHE" -- the free count
> is the number of items in the UMA cache. These may be hung in buckets
> off the per-CPU cache, or be spare buckets in the zone. Either way, the
> memory has to be reclaimed before it can be used for other purposes, and
> generally for complex objects, it can be allocated much more quickly
> than going back to VM for more memory. LIMIT is an administrative limit
> that may be configured on the zone, and is configured for some but not
> all zones.

And every unlimited zone after growing on demand can cause
kmem_map/kmem_size panics, or some will low-memeory panics with message
about another map?

> I'll let someone with a bit more VM experience follow up with more
> information about how the various maps and submaps relate to each other.

That would be good, as I still don'tany idea about exact meaning of those
sysctls :-) Thans for explanations, though. How is our Mr. VM nowadays?..

>>> (which can be swapped out under heavy memory load), pipe buffers, and
>>> general cached data for the buffer cache / file system, which will be
>>> paged out or discarded when memory pressure goes up.
>>
>> Umm. I think there is no point in swapping disk cache which can be
>> discarded, so the most actual part of kernel memory which is swappable
>> are anonymous pipe(2) buffers?
>
> Yes, that's what I meant. There are some other types of pageable kernel
> memory, such as memory used for swap-backed md devices.

Hmm, I do remember messages about malloc-backed md devices panics (with
workaround advices to switch to swap-backed md), yes...

--
WBR, Vadim Goncharov

Robert Watson

unread,

Jan 8, 2008, 2:23:29 PM1/8/08

to

On Wed, 9 Jan 2008, Vadim Goncharov wrote:

>>> Yes, I've known it, but didn't known what column names exactly mean.
>>> Requests/Failures, I guess, is a pure statistics, Size is one element
>>> size, but why USED + FREE != LIMIT (on whose where limit is non-zero) ?
>>
>> Possibly we should rename the "FREE" column to "CACHE" -- the free count is
>> the number of items in the UMA cache. These may be hung in buckets off the
>> per-CPU cache, or be spare buckets in the zone. Either way, the memory has
>> to be reclaimed before it can be used for other purposes, and generally for
>> complex objects, it can be allocated much more quickly than going back to
>> VM for more memory. LIMIT is an administrative limit that may be
>> configured on the zone, and is configured for some but not all zones.
>
> And every unlimited zone after growing on demand can cause
> kmem_map/kmem_size panics, or some will low-memeory panics with message
> about another map?

Well, there are also limits not imposed using the UMA limit mechanism, so just
because it appears unbounded in vmstat -z doesn't mean there's no limit.
There's no UMA zone limit on processes, but there's a separately imposed
maxproc limit--and as a result, filedesc, which is typically one per process,
is also bounded to approximately maxproc. Likewise, many other data
structures effectively scale with the number of processes, the size of
physical memory, the size of the address space, maxusers, etc.

There are relatively few things that actually have no limit associated with
them one way or another, precisely because if there's no limit it can lead the
kernel to become starved of resources. Where there isn't a limit, ideally
privilege is required to allocate (i.e., malloc-backed swap requires root
privilege to configure). Sometimes the limits are much more complex than a
single global limit, such as resources controlled using resource limits, which
can be per-process, per-uid, etc.

Robert N M Watson
Computer Laboratory
University of Cambridge

Dan Nelson

unread,

Jan 9, 2008, 12:46:46 AM1/9/08

to

In the last episode (Jan 09), Ivan Voras said:
> On 08/01/2008, Dag-Erling Smorgrav <d...@des.no> wrote:
> > Actually, it fails to mention the most important bit:
> > vfs.zfs.arc_max, which allows you to restrict the amount of memory
> > used by ZFS to something comfortably smaller than vm.kmem_size.
>
> Pawel, is it recommended?
>
> If it is, I'll add it to the page.

With the vm_kern.c.2.patch, it doesn't seem to be an issue, at least
for me. "c" always stays far away from "c_max":

kstat.zfs.misc.arcstats.p: 218885440
kstat.zfs.misc.arcstats.c: 342346436
kstat.zfs.misc.arcstats.c_min: 20971520
kstat.zfs.misc.arcstats.c_max: 503316480
kstat.zfs.misc.arcstats.size: 342342144
vm.kmem_size: 671088640
hw.physmem: 1064771584
vm.kmem_map_panics_avoided: 171

The last sysctl was added by me to track how often the patch saved my
system from a panic :) I suppose lowering arc_max would reduce the
number of times the routine was called, though.

--
Dan Nelson
dne...@allantgroup.com

Scot Hetzel

unread,

Jan 9, 2008, 12:40:51 PM1/9/08

to

On 1/8/08, Dag-Erling Sm=F8rgrav <d...@des.no> wrote:

> Oliver Fromme <ol...@lurza.secnetix.de> writes:
> > How about including the URL of the ZFS tuning guide in the
> > warning message:
> >
> > http://wiki.freebsd.org/ZFSTuningGuide
> >
> > It contains all the necessary information for both i386 and
> > amd64 machines.
>

> Actually, it fails to mention the most important bit: vfs.zfs.arc_max,
> which allows you to restrict the amount of memory used by ZFS to
> something comfortably smaller than vm.kmem_size.
>

It was in the ZFS tunning guide, but was removed in revision 20.
Doesn't say why the change was made.

Scot

Mark Powell

unread,

Jan 12, 2008, 5:14:11 PM1/12/08

to

On Tue, 8 Jan 2008, Oliver Fromme wrote:

> http://wiki.freebsd.org/ZFSTuningGuide
>
> It contains all the necessary information for both i386 and amd64

> machines. It can also easily be updated if necessary so people always
> get the most up-to-date information.

Pawel said in Nov:

-----
> kern.maxvnodes: 400000

The Wiki should be changed. Allow ZFS to autotune it, don't tune it by
hand.
-----

Yet the wiki still recommends hand tuning?
Cheers.

--
Mark Powell - UNIX System Administrator - The University of Salford
Information Services Division, Clifford Whitworth Building,
Salford University, Manchester, M5 4WT, UK.
Tel: +44 161 295 6843 Fax: +44 161 295 5888 www.pgp.com for PGP key