Hello,
In order to avoid having to type everything again, I'll refer
to the commit log. PLEASE READ IT IN FULL:
Bring in mbuma to replace mballoc.
mbuma is an Mbuf & Cluster allocator built on top of a number of
extensions to the UMA framework, all included herein.
Extensions to UMA worth noting:
- Better layering between slab <-> zone caches; introduce
Keg structure which splits off slab cache away from the
zone structure and allows multiple zones to be stacked
on top of a single Keg (single type of slab cache);
perhaps we should look into defining a subset API on
top of the Keg for special use by malloc(9),
for example.
- UMA_ZONE_REFCNT zones can now be added, and reference
counters automagically allocated for them within the end
of the associated slab structures. uma_find_refcnt()
does a kextract to fetch the slab struct reference from
the underlying page, and lookup the corresponding refcnt.
mbuma things worth noting:
- integrates mbuf & cluster allocations with extended UMA
and provides caches for commonly-allocated items; defines
several zones (two primary, one secondary) and two kegs.
- change up certain code paths that always used to do:
m_get() + m_clget() to instead just use m_getcl() and
try to take advantage of the newly defined secondary
Packet zone.
- netstat(1) and systat(1) quickly hacked up to do basic
stat reporting but additional stats work needs to be
done once some other details within UMA have been taken
care of and it becomes clearer to how stats will work
within the modified framework.
>From the user perspective, one implication is that the
NMBCLUSTERS compile-time option is no longer used. The
maximum number of clusters is still capped off according
to maxusers, but it can be made unlimited by setting
the kern.ipc.nmbclusters boot-time tunable to zero.
Work should be done to write an appropriate sysctl
handler allowing dynamic tuning of kern.ipc.nmbclusters
at runtime.
Additional things worth noting/known issues (READ):
- One report of 'ips' (ServeRAID) driver acting really
slow in conjunction with mbuma. Need more data.
Latest report is that ips is equally sucking with
and without mbuma.
- Giant leak in NFS code sometimes occurs, can't
reproduce but currently analyzing; brueffer is
able to reproduce but THIS IS NOT an mbuma-specific
problem and currently occurs even WITHOUT mbuma.
- Issues in network locking: there is at least one
code path in the rip code where one or more locks
are acquired and we end up in m_prepend() with
M_WAITOK, which causes WITNESS to whine from within
UMA. Current temporary solution: force all UMA
allocations to be M_NOWAIT from within UMA for now
to avoid deadlocks unless WITNESS is defined and we
can determine with certainty that we're not holding
any locks when we're M_WAITOK.
- I've seen at least one weird socketbuffer empty-but-
mbuf-still-attached panic. I don't believe this
to be related to mbuma but please keep your eyes
open, turn on debugging, and capture crash dumps.
This change removes more code than it adds.
A paper is available detailing the change and considering
various performance issues, it was presented at BSDCan2004:
http://www.unixdaemons.com/~bmilekic/netbuf_bmilekic.pdf
Please read the paper for Future Work and implementation
details, as well as credits.
Testing and Debugging:
rwatson,
brueffer,
Ketrien I. Saihr-Kesenchedra,
...
Reviewed by: Lots of people (for different parts)
SHOULD YOU HAVE ANY ISSUES:
- Turn on INVARIANTS
- Turn on WITNESS
- Send stack trace and if possible capture crash dump
- Might require further information from you, please provide
reachable Email address.
- When you Email me, please include "MBUMA" in the Subject
line.
Cheers,
Bosko
_______________________________________________
freebsd...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-curre...@freebsd.org"
On Mon, May 31, 2004 at 02:51:01PM -0700, Bosko Milekic wrote:
B> In order to avoid having to type everything again, I'll refer
B> to the commit log. PLEASE READ IT IN FULL:
B>
B> Bring in mbuma to replace mballoc.
B>
B> mbuma is an Mbuf & Cluster allocator built on top of a number of
B> extensions to the UMA framework, all included herein.
Have you done any performance tests? How this new allocator affects
network performance?
How stable is it? :) Yesterday I was planning to upgrade CURRENT on
my production router. Should I do it?
--
Totus tuus, Glebius.
GLEBIUS-RIPN GLEB-RIPE
> How stable is it? :) Yesterday I was planning to upgrade CURRENT on my
> production router. Should I do it?
If it were me, I'd wait a couple of days for it to settle before putting
it on a production box. However, Bosko would probably love it if you
tried it on the router just to get the coverage. Definitely keep a backup
kernel on-hand though.
Robert N M Watson FreeBSD Core Team, TrustedBSD Projects
rob...@fledge.watson.org Senior Research Scientist, McAfee Research
I went ahead and upgraded as soon as I saw the announcement, mainly just
for the new capability of using an unlimited mbufs setting.
It seems to be working just fine. I've experienced no problems with it
whatsoever so far. No negative impact on network performance.
Nice work there, Bosko!
(now, if we could just get the ACPI-related hangs at shutdown fixed) :-)
--
Conrad Sabatier <con...@cox.net> - "In Unix veritas"
Yes. Please consult the paper.
http://www.unixdaemons.com/~bmilekic/netbuf_bmilekic.pdf
Better setups, notably better hardware, are welcome.
>How stable is it? :) Yesterday I was planning to upgrade CURRENT on
>my production router. Should I do it?
A) -CURRENT is doing a lot better these days and in order to support
its development and improvement further you are encouraged to put
it in loaded environments, as well as provide valuable
debugging information.
B) Your idea of putting -CURRENT on your production router yesterday
made me run to the shed and cover my eyes and ears in fear and
agony of the bleeding-edge.
If you wish to put -CURRENT into production, please go to page 57
(option A), otherwise page 2342 (option B).
Cheers,
Bosko.
On Wed, 2 Jun 2004 07:01, Conrad Sabatier wrote:
> (now, if we could just get the ACPI-related hangs at shutdown fixed) :-)
Try going to v1.5 of sys/i386/i386/intr_machdep.c
Worked for me :)
- --
Daniel O'Connor software and network engineer
for Genesis Software - http://www.gsoft.com.au
"The nice thing about standards is that there
are so many of them to choose from."
-- Andrew Tanenbaum
GPG Fingerprint - 5596 B766 97C0 0E94 4347 295E E593 DC20 7B3F CE8C
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (FreeBSD)
iD8DBQFAvSF/5ZPcIHs/zowRAkN4AJ9QYT8zIKsfhkgkoEAazdfa3S0QaACglVAf
x/hWmsy5yrnNWOPFZ5eNdTE=
=/tyG
-----END PGP SIGNATURE-----
[snip]
> - Giant leak in NFS code sometimes occurs, can't
> reproduce but currently analyzing; brueffer is
> able to reproduce but THIS IS NOT an mbuma-specific
> problem and currently occurs even WITHOUT mbuma.
Just a mee too on this one.
I'm running two NFS servers on 5.2.1-RELEASE (with the old mbuf code
obviously), and I see the problem with leakage here. Don't know a way
to reproduce it yet, but it happends over time. This happends both on a
UP and a SMP machine.
After more than a month of uptime, the mbuf map starts to get eaten up.
Is there any way to look at the mbuf cache while the system is running
to see what is in there?
Example:
host1: up 76 days
mbuf usage:
GEN cache: 1962/4256 (in use/in pool)
CPU #0 cache: 22742/23040 (in use/in pool)
Total: 24704/27296 (in use/in pool)
Mbuf cache high watermark: 512
Maximum possible: 51200
Allocated mbuf types:
24670 mbufs allocated to data
31 mbufs allocated to packet headers
3 mbufs allocated to socket names and addresses
53% of mbuf map consumed
host2: up 23 days
mbuf usage:
GEN cache: 1/768 (in use/in pool)
CPU #0 cache: 254/608 (in use/in pool)
CPU #1 cache: 0/512 (in use/in pool)
CPU #2 cache: 0/512 (in use/in pool)
CPU #3 cache: 3/512 (in use/in pool)
Total: 258/2912 (in use/in pool)
Mbuf cache high watermark: 512
Maximum possible: 51200
Allocated mbuf types:
258 mbufs allocated to data
5% of mbuf map consumed
Cheers,
Frode Nordahl
On Mon, May 31, 2004 at 02:51:01PM -0700, Bosko Milekic wrote:
B> mbuma is an Mbuf & Cluster allocator built on top of a number of
B> extensions to the UMA framework, all included herein.
are you going to convert mbuf tag allocator to UMA? Now
tags are allocated with malloc(). AFAIK, tags are used heavily in pf,
and forthcoming ALTQ. Moving to UMA should affect their performance
positively.
--
Totus tuus, Glebius.
GLEBIUS-RIPN GLEB-RIPE
[deletia]
> are you going to convert mbuf tag allocator to UMA? Now
>tags are allocated with malloc(). AFAIK, tags are used heavily in pf,
>and forthcoming ALTQ. Moving to UMA should affect their performance
>positively.
First off, malloc() *is* UMA. With mbuma in the tree, I don't believe
we have any remaining custom-allocators in the tree.
As for what to do with m_tags, it is still unclear to me. Personally,
I'm conflicted about their use. On one hand, they offer a clean way
to attach metadata to packets, but on the other hand they are quite
expensive.
If you read the paper on mbuma, you'll notice that I point out that it
would be worth investigating whether, in scenarios where an m_tag is
ALWAYS required per packet (e.g., MAC), providing a secondary zone with
pre-allocated m_tags for packet headers might be worth it. Prior to
this work, however, I suggest we investigate the possibility of using
smaller mini-mbufs whenever clusters are used so that space wastage
is reduced.
-Bosko
I'm afraid you misunderstood what I meant by "Giant leak." :-)
I was referring to a potential leak of the Giant _lock_, not a leak
of Mbufs and Clusters. But what you are seeing seems to very much be
an Mbuf leak, at least, and should be investigated.
Please cvsup to -CURRENT (follow the -current mailing list to figure
out when it should be reasonably safe to do so) and verify whether this
problem has been fixed, as it is posssible that it already has.
If not, collect as most data as possible and get back to me.
*whoops* :-)
> Please cvsup to -CURRENT (follow the -current mailing list to figure
> out when it should be reasonably safe to do so) and verify whether
> this
> problem has been fixed, as it is posssible that it already has.
Not sure if I dare do this on the production server, but I will set up
a test system to see if I can reproduce and test against -CURRENT.
Thanks for your response
Cheers,
Frode
You probably meant you wanted to use a UMA zone. m_tag's can already be
allocated using this mechanism. I did it once for vlan tags but botched it
(didn't handle module references properly) so backed it. But there's no
reason someone cannot redo it or convert other heavily used fixed size tags
to use a zone.
Sam
Exactly.
What about using its own UMA zone for each m_tag consumer: pf, ALTQ, divert,
vlan? Each module allocates its zone on startup, and later a reference to
this zone is passed to m_tag_alloc().
S> allocated using this mechanism. I did it once for vlan tags but botched it
S> (didn't handle module references properly) so backed it. But there's no
S> reason someone cannot redo it or convert other heavily used fixed size tags
S> to use a zone.
Have you saved your efforts? May I look at them?
--
Totus tuus, Glebius.
GLEBIUS-RIPN GLEB-RIPE
They are in the CVS history of sys/net/if_vlan.c.
-- Brooks
--
Any statement of the form "X is the one, true Y" is FALSE.
PGP fingerprint 655D 519C 26A7 82E7 2529 9BF0 5D8E 8BE9 F238 1AD4
I believe only Poul-Henning was actually suggesting something along
these lines at the end of the talk. As I explained then, this is not
a good approach.
First of all, it pessimizes the send case. You no longer have 2K of
space for payload, you have 2K - whatever the mbuf struct is.
Secondly, even on the receiver side, it is probably not worth the
complication, especially with mbuma now. You should really both
read the paper and read the code before you go ahead and toy around
with an idea like this. You will notice that when we need both a
cluster and an mbuf, there is no longer a double-allocation in the
common case. It is a single object allocation as we maintain a
secondary zone which caches mbufs with pre-allocated clusters.
That means you're looking at one pcpu UMA lock in the common case
right now and, if I have my way eventually, no locks whatsoever
in the common UMA allocation path.
By increasing the mbuf size, you will create a third type of object
called a "large mbuf." You'll still need clusters for the sender
side, and regular mbufs for smaller packets (there are many of these,
refer to the paper). So in actual fact, you're not fixing anything,
you're just introducing another object type that the mbuf code now
needs to identify and free appropriately from the free routine.
How about instead of this, you look into creating mini-mbufs,
which are sort of like regular mbufs, but without the internal data
region, and which are ONLY used for external storage. They work
for all types of external storage, waste less space, and can be
cached within a UMA zone and thus allocated as effectively a single
object in the common case; this is exactly what happens now already
with m_getcl(), except that there is some additional wastage due to
the internal mbuf data region not being used.
>Since the double allocation required to create a cluster makes the locking
>(and cache slushing) requirements go up, it is probably worthwhile to
>investigate if raising the nominal mbuf size doesn't end up decreasing
>overall memory pressure. If you allocate more memory, but the allocation
>takes less time due to the simpler locking, you may actually decrease the
>total memory need.
No. See above.
>This is worth investigating partly because it is such a simple change. I
>propose investigating with mbuf size of 2K, large enough to fit standard
>ethernet frames, and a cluster size of 8K, which means a cluster mbuf is
>large enough to hold a 9K jumbo frame.
LOL.
>Now that you've got mbuma in the tree, I can test this for you, unless this
>proposal catches your interest enough to give it a try. I'll see if I
>can't get a couple of our beefier machines at work updated to -CURRENT in
>the next week.
>
>Thanks for the good work.
Sure. You can test whatever you like.
-Bosko
It looks your host2 has normal value of mbuf usage. Is there the
difference of apprication with these hosts?
I'm using net-snmp and mrtg to collect statistics of mbuf/mbufc
usage. If you can find any trends (increasing at specific time period
or so), it would be help for further debugging.
---- mrtg.cfg
Target[mbuf]:hrStorageUsed.1&hrStorageUsed.4:public@server
MaxBytes1[mbuf]: 40960
MaxBytes2[mbuf]: 20480
Options[mbuf]: gauge,nopercent,growright
Title[mbuf]: Number of mbufs in use
YLegend[mbuf]: mbufs
ShortLegend[mbuf]:
LegendI[mbuf]: mbufs in use
LegendO[mbuf]: mbuf clusters in use
PageTop[mbuf]: <H1>Number of mbufs in use</H1>
--
Jun Kuriyama <kuri...@imgsrc.co.jp> // IMG SRC, Inc.
<kuri...@FreeBSD.org> // FreeBSD Project
I see now, thanks.
Question to Sam: have you performed any tests? Is this definitely
true, that UMAllocing in special zone is faster than general malloc()?
--
Totus tuus, Glebius.
GLEBIUS-RIPN GLEB-RIPE
It may also be worthwhile investigating eliminating clusters entirely. This
is the point Poul-Henning, Robert and I were trying to make at the end of
you talk at BSDCan.
Since the double allocation required to create a cluster makes the locking
(and cache slushing) requirements go up, it is probably worthwhile to
investigate if raising the nominal mbuf size doesn't end up decreasing
overall memory pressure. If you allocate more memory, but the allocation
takes less time due to the simpler locking, you may actually decrease the
total memory need.
This is worth investigating partly because it is such a simple change. I
propose investigating with mbuf size of 2K, large enough to fit standard
ethernet frames, and a cluster size of 8K, which means a cluster mbuf is
large enough to hold a 9K jumbo frame.
Now that you've got mbuma in the tree, I can test this for you, unless this
proposal catches your interest enough to give it a try. I'll see if I
can't get a couple of our beefier machines at work updated to -CURRENT in
the next week.
Thanks for the good work.
--
Where am I, and what am I doing in this handbasket?
Wes Peters w...@softweyr.com
Allocating from a zone was noticeable for gige interfaces, especially on my
SMP configuration (which was running w/o Giant). For non-gige interfaces the
overhead of using malloc is not noticeable (as I reported when I first
converted vlan handling over to use tags). Regardless the point was that you
can already use a zone for tags if you want.
Sam