Optimum size for malloc() allocation?

1,261 views
Skip to first unread message

Jeffrey Kegler

unread,
Jan 5, 2014, 12:49:00 AM1/5/14
to marpa-...@googlegroups.com
This is a bit of a long story, but it's a question which I've tried to research without must progress, and perhaps someone on the list can help.  I've finished up my rework of Marpa's memory allocator.  It's based on GNU's obstacks, but very heavily hacked at this point.  It's extremely efficient, but efficiency is *not* its most important advantage.  Marpa has lots of complex data structures which can be very hard to destroy, particularly under error conditions -- they may still be under construction.  If you look at Ben Pfaff's AVL code, his most complex logic is not for the AVL data structures themselves, but the code that makes sure his AVL's don't leak memory.

Obstacks eliminate my having to worry about this.  Each object (recognizer, grammar, etc.) has an obstack and almost everything is allocated from that obstack.  When the object is destroyed, there's no need to follow all the links, make sure everything is destroyed in the right order, etc., etc. -- I just delete the obstack in which everything was allocated.

Marpa's obstacks have the disadvantage that I cannot resize allocations.  I also cannot free any of the memory without destroying the entire obstack.  In Marpa, I rarely need to do either of these things.  For the few cases where obstacks fall short of what I need, I just fall back on malloc().

Marpa's obstacks get their memory in big blocks from malloc().  For obstacks, I have a good deal of freedom in deciding how much memory to request.  My question is, what is the best size?

I am currently followig the strategy used by GNU's obstacks.  They allocated, by default, 4K less a complex calculation based on the size of various malloc() headers, etc.  Their idea was that if they ask for exactly 4K, GNU's malloc(), because of the extra few bytes needed for headers, would need to allocate an extra block.  That suggests that doing many allocations, all of exactly 4K, is an almost pessimal way to use malloc().  But the GNU obstack authors append a comment noting that they think their analysis, as of the latest GNU malloc, may be obsolete -- that lots of allocations of exactly 4K are no longer worse than any other strategy.

Most helpful for me would be pointers to the writeups where this issue is addressed.  I've tried to research this, but the papers that I find focus on how well they deal with difficult mixes of allocation sizes, and not on how an application with a lot of flexibility can help malloc() out.

Thanks, jeffrey

Ruslan Shvedov

unread,
Jan 5, 2014, 5:59:34 AM1/5/14
to marpa-...@googlegroups.com
From what the obstacks authors say it looks like if range checking is off [1] you can still use 4K as the best size (16 bytes extra looks like a safe bet otherwise [2]) and "less sensitive to the size of the request" as applies to "the new" GNU C malloc seems to mean that it does not align block sizes to powers of 2 [4] (as obstack doc says [3]) and just uses 4K fixed block size [5], well, most of the time [6].

Overall, the absolutely best way for an application to help malloc() seems to be determine the pagesize [7] and align to it, e.g. by setting obstack default chunk size to it. Otherwise, 4K would well be the optimal size. 

Not exactly what's asked, but hopefully helpful.

References:

[1] "12 is sizeof (mhead) and 4 is EXTRA from GNU malloc.
Use the values for range checking, because if range checking is off,
the extra bytes won't be missed terribly, but if range checking is on
and we used a larger request, a whole extra 4096 bytes would be
allocated."
"These number are irrelevant to the new GNU malloc. I suspect it is
less sensitive to the size of the request."

[2] Allocated memory contains an 8 or 16 byte overhead for the size of the chunk and usage flags.
— http://en.wikipedia.org/wiki/C_dynamic_memory_allocation#dlmalloc

[3] If you allocate chunks with malloc, the chunk size should be a power of 2.
http://gcc.gnu.org/onlinedocs/libiberty/Obstack-Chunks.html

[4] the malloc in the GNU C Library does not round up block sizes to powers of two, neither for large nor for small sizes
http://www.gnu.org/software/libc/manual/html_node/Efficiency-and-Malloc.html#Efficiency-and-Malloc

[5] The new GNU malloc improves on things by allocating large objects in chunks of 4096 bytes rather than in ever larger powers of two, which results in ever larger wastage.

[6] To the extent possible, this malloc manages memory from the system in page-size units.

#    define malloc_getpagesize (4096) /* just guess */

malloc_getpagesize   (default: derived from system #includes)


[7] 
Most operating systems allow programs to discover the page size at runtime. This allows programs to use memory more efficiently by aligning allocations to this size and reducing overall internal fragmentation of pages.
http://en.wikipedia.org/wiki/Page_size#Determining_the_page_size_in_a_program

--
You received this message because you are subscribed to the Google Groups "marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to marpa-parser...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Jeffrey Kegler

unread,
Jan 5, 2014, 10:12:48 AM1/5/14
to marpa-...@googlegroups.com
Thanks for gathering these references, which I read or re-read.� Reading them and thinking, I was reminded to have another look at Linus Torvald's git, which has its own very interesting memory allocator, which Linus calls "slabs".� Linus's allocator gets its memory from malloc() in big slabs, which he sizes as follows:
/* allocate ~512kB at once, allowing for malloc overhead */
#ifndef COMMIT_SLAB_SIZE
#define COMMIT_SLAB_SIZE (512*1024-32)
#endif

So Linus's answer seems to be allocate a good-sized power of two, less 32 btyes to allow for overhead.� It sounds reasonable, and Linus tends to know on these matters.

So, in the Marpa case, 4096-32 is probably the best choice.� (I was tempted to make it 4096-42, but I suspect Linus picked a power of two because it is most efficient in a buddy-system allocator.)

-- jeffrey

On 01/05/2014 02:59 AM, Ruslan Shvedov wrote:
From what the obstacks authors say it looks like if�range checking is off�[1] you can still use 4K as the best size (16 bytes extra looks like a safe bet�otherwise�[2]) and "less sensitive to the size of the request" as applies to "the new" GNU C malloc seems to mean that it does not align block sizes to powers of 2 [4]�(as obstack doc says�[3]) and just uses 4K�fixed�block size�[5], well, most of the time�[6].

Overall, the absolutely best way for an application to help malloc() seems to be determine the pagesize�[7] and align to�it, e.g. by setting obstack default chunk size to it. Otherwise, 4K would well be the optimal size.�

Not exactly what's asked, but hopefully helpful.

References:

[1] "12 is sizeof (mhead) and 4 is EXTRA from GNU malloc.
Use the values for range checking, because if range checking is off,
the extra bytes won't be missed terribly, but if range checking is on
and we used a larger request, a whole extra 4096 bytes would be
allocated."
"These number are irrelevant to the new GNU malloc. I suspect it is
less sensitive to the size of the request."

[2] Allocated memory contains an 8 or 16 byte overhead for the size of the chunk and usage flags.

[3] If you allocate chunks with malloc, the chunk size should be a power of 2.

[4] the malloc in the GNU C Library�does not round up block sizes to powers of two, neither for large nor for small sizes
� http://www.gnu.org/software/libc/manual/html_node/Efficiency-and-Malloc.html#Efficiency-and-Malloc

[5] The new GNU malloc improves on things by allocating large objects in chunks of 4096 bytes�rather than in ever larger powers of two, which results in ever larger wastage.

[6]�To the extent possible, this malloc manages memory from the system in page-size units.

# � �define malloc_getpagesize (4096) /* just guess */

malloc_getpagesize � (default: derived from system #includes)


[7]�
Most operating systems allow programs to discover the page size at runtime. This allows programs to use memory more efficiently by aligning allocations to this size and reducing overall internal fragmentation of pages.

On Sun, Jan 5, 2014 at 7:49 AM, Jeffrey Kegler <jeffre...@jeffreykegler.com> wrote:
This is a bit of a long story, but it's a question which I've tried to research without must progress, and perhaps someone on the list can help.� I've finished up my rework of Marpa's memory allocator.� It's based on GNU's obstacks, but very heavily hacked at this point.� It's extremely efficient, but efficiency is *not* its most important advantage.� Marpa has lots of complex data structures which can be very hard to destroy, particularly under error conditions -- they may still be under construction.� If you look at Ben Pfaff's AVL code, his most complex logic is not for the AVL data structures themselves, but the code that makes sure his AVL's don't leak memory.

Obstacks eliminate my having to worry about this.� Each object (recognizer, grammar, etc.) has an obstack and almost everything is allocated from that obstack.� When the object is destroyed, there's no need to follow all the links, make sure everything is destroyed in the right order, etc., etc. -- I just delete the obstack in which everything was allocated.

Marpa's obstacks have the disadvantage that I cannot resize allocations.� I also cannot free any of the memory without destroying the entire obstack.� In Marpa, I rarely need to do either of these things.� For the few cases where obstacks fall short of what I need, I just fall back on malloc().

Marpa's obstacks get their memory in big blocks from malloc().� For obstacks, I have a good deal of freedom in deciding how much memory to request.� My question is, what is the best size?

I am currently followig the strategy used by GNU's obstacks.� They allocated, by default, 4K less a complex calculation based on the size of various malloc() headers, etc.� Their idea was that if they ask for exactly 4K, GNU's malloc(), because of the extra few bytes needed for headers, would need to allocate an extra block.� That suggests that doing many allocations, all of exactly 4K, is an almost pessimal way to use malloc().� But the GNU obstack authors append a comment noting that they think their analysis, as of the latest GNU malloc, may be obsolete -- that lots of allocations of exactly 4K are no longer worse than any other strategy.

Most helpful for me would be pointers to the writeups where this issue is addressed.� I've tried to research this, but the papers that I find focus on how well they deal with difficult mixes of allocation sizes, and not on how an application with a lot of flexibility can help malloc() out.

Thanks, jeffrey
--
You received this message because you are subscribed to the Google Groups "marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to marpa-parser...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ruslan Zakirov

unread,
Jan 6, 2014, 5:16:48 PM1/6/14
to marpa-parser
Hi,

Take a look at malloc_good_size, malloc_size and malloc_usable_size which are available on some systems. At least you can get answer about what happens on your system.

I checked once on MacOS X and as far as I recall the largest padding was 32 bytes. I can be wrong as it was several years ago, but for sure it was not even close to 4k.


On Sun, Jan 5, 2014 at 7:12 PM, Jeffrey Kegler <jeffre...@jeffreykegler.com> wrote:
Thanks for gathering these references, which I read or re-read.  Reading them and thinking, I was reminded to have another look at Linus Torvald's git, which has its own very interesting memory allocator, which Linus calls "slabs".  Linus's allocator gets its memory from malloc() in big slabs, which he sizes as follows:

/* allocate ~512kB at once, allowing for malloc overhead */
#ifndef COMMIT_SLAB_SIZE
#define COMMIT_SLAB_SIZE (512*1024-32)
#endif

So Linus's answer seems to be allocate a good-sized power of two, less 32 btyes to allow for overhead.  It sounds reasonable, and Linus tends to know on these matters.

So, in the Marpa case, 4096-32 is probably the best choice.  (I was tempted to make it 4096-42, but I suspect Linus picked a power of two because it is most efficient in a buddy-system allocator.)

-- jeffrey

On 01/05/2014 02:59 AM, Ruslan Shvedov wrote:
From what the obstacks authors say it looks like if range checking is off [1] you can still use 4K as the best size (16 bytes extra looks like a safe bet otherwise [2]) and "less sensitive to the size of the request" as applies to "the new" GNU C malloc seems to mean that it does not align block sizes to powers of 2 [4] (as obstack doc says [3]) and just uses 4K fixed block size [5], well, most of the time [6].

Overall, the absolutely best way for an application to help malloc() seems to be determine the pagesize [7] and align to it, e.g. by setting obstack default chunk size to it. Otherwise, 4K would well be the optimal size. 

Not exactly what's asked, but hopefully helpful.

References:

[1] "12 is sizeof (mhead) and 4 is EXTRA from GNU malloc.
Use the values for range checking, because if range checking is off,
the extra bytes won't be missed terribly, but if range checking is on
and we used a larger request, a whole extra 4096 bytes would be
allocated."
"These number are irrelevant to the new GNU malloc. I suspect it is
less sensitive to the size of the request."

[2] Allocated memory contains an 8 or 16 byte overhead for the size of the chunk and usage flags.

[3] If you allocate chunks with malloc, the chunk size should be a power of 2.

[4] the malloc in the GNU C Library does not round up block sizes to powers of two, neither for large nor for small sizes
http://www.gnu.org/software/libc/manual/html_node/Efficiency-and-Malloc.html#Efficiency-and-Malloc

[5] The new GNU malloc improves on things by allocating large objects in chunks of 4096 bytes rather than in ever larger powers of two, which results in ever larger wastage.

[6] To the extent possible, this malloc manages memory from the system in page-size units.

#    define malloc_getpagesize (4096) /* just guess */

malloc_getpagesize   (default: derived from system #includes)


[7] 
Most operating systems allow programs to discover the page size at runtime. This allows programs to use memory more efficiently by aligning allocations to this size and reducing overall internal fragmentation of pages.

On Sun, Jan 5, 2014 at 7:49 AM, Jeffrey Kegler <jeffre...@jeffreykegler.com> wrote:
This is a bit of a long story, but it's a question which I've tried to research without must progress, and perhaps someone on the list can help.  I've finished up my rework of Marpa's memory allocator.  It's based on GNU's obstacks, but very heavily hacked at this point.  It's extremely efficient, but efficiency is *not* its most important advantage.  Marpa has lots of complex data structures which can be very hard to destroy, particularly under error conditions -- they may still be under construction.  If you look at Ben Pfaff's AVL code, his most complex logic is not for the AVL data structures themselves, but the code that makes sure his AVL's don't leak memory.

Obstacks eliminate my having to worry about this.  Each object (recognizer, grammar, etc.) has an obstack and almost everything is allocated from that obstack.  When the object is destroyed, there's no need to follow all the links, make sure everything is destroyed in the right order, etc., etc. -- I just delete the obstack in which everything was allocated.

Marpa's obstacks have the disadvantage that I cannot resize allocations.  I also cannot free any of the memory without destroying the entire obstack.  In Marpa, I rarely need to do either of these things.  For the few cases where obstacks fall short of what I need, I just fall back on malloc().

Marpa's obstacks get their memory in big blocks from malloc().  For obstacks, I have a good deal of freedom in deciding how much memory to request.  My question is, what is the best size?

I am currently followig the strategy used by GNU's obstacks.  They allocated, by default, 4K less a complex calculation based on the size of various malloc() headers, etc.  Their idea was that if they ask for exactly 4K, GNU's malloc(), because of the extra few bytes needed for headers, would need to allocate an extra block.  That suggests that doing many allocations, all of exactly 4K, is an almost pessimal way to use malloc().  But the GNU obstack authors append a comment noting that they think their analysis, as of the latest GNU malloc, may be obsolete -- that lots of allocations of exactly 4K are no longer worse than any other strategy.

Most helpful for me would be pointers to the writeups where this issue is addressed.  I've tried to research this, but the papers that I find focus on how well they deal with difficult mixes of allocation sizes, and not on how an application with a lot of flexibility can help malloc() out.

Thanks, jeffrey
--
You received this message because you are subscribed to the Google Groups "marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to marpa-parser...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
You received this message because you are subscribed to the Google Groups "marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to marpa-parser...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to marpa-parser...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



--
Best regards, Ruslan.

Reini Urban

unread,
Jan 6, 2014, 5:25:39 PM1/6/14
to marpa-...@googlegroups.com
It depends which malloc your glibc is using. Mostly ptmalloc2 on linux.

Some bsd's have the better ptmalloc3 already, which can use obstacks
via _independent_comalloc() and avoid the crazy malloc overhead by
using 4k obstacks manually already.
On the perl-compiler I try to detect ptmalloc3 at installation time
(like LD_PRELOAD) and use the better _independent_comalloc() then.

With potion I mmap pages directly in 4K chunks, because I rather
handle memory manually than leaving it to malloc/free.
Reini Urban
http://cpanel.net/ http://www.perl-compiler.org/

Jeffrey Kegler

unread,
Jan 6, 2014, 7:32:01 PM1/6/14
to marpa-...@googlegroups.com
@Reini: I tried to find the potion code on your Github site but it didn't jump out at me.� Could you point me at you mmap() code?

Using mmap() seems like getting a bit too too close to the metal, but by the time I am devoting a lot of effort to second-quessing malloc() on its header sizes, it may make sense.� After all, if what I really want is a page, why not just ask for a page?

-- jeffrey

Thanks for gathering these references, which I read or re-read.  Reading
them and thinking, I was reminded to have another look at Linus Torvald's
git, which has its own very interesting memory allocator, which Linus calls
"slabs".  Linus's allocator gets its memory from malloc() in big slabs,
which he sizes as follows:

/* allocate ~512kB at once, allowing for malloc overhead */
#ifndef COMMIT_SLAB_SIZE
#define COMMIT_SLAB_SIZE (512*1024-32)
#endif


So Linus's answer seems to be allocate a good-sized power of two, less 32
btyes to allow for overhead.  It sounds reasonable, and Linus tends to know
on these matters.

So, in the Marpa case, 4096-32 is probably the best choice.  (I was
tempted to make it 4096-42, but I suspect Linus picked a power of two
because it is most efficient in a buddy-system allocator.)

-- jeffrey

On 01/05/2014 02:59 AM, Ruslan Shvedov wrote:

From what the obstacks authors say it looks like if range checking is off
[1] you can still use 4K as the best size (16 bytes extra looks like a safe
bet otherwise [2]) and "less sensitive to the size of the request" as
applies to "the new" GNU C malloc seems to mean that it does not align block
sizes to powers of 2 [4] (as obstack doc says [3]) and just uses 4K fixed
block size [5], well, most of the time [6].

Overall, the absolutely best way for an application to help malloc() seems
to be determine the pagesize [7] and align to it, e.g. by setting obstack
default chunk size to it. Otherwise, 4K would well be the optimal size.

Not exactly what's asked, but hopefully helpful.

References:

[1] "12 is sizeof (mhead) and 4 is EXTRA from GNU malloc.
Use the values for range checking, because if range checking is off,
the extra bytes won't be missed terribly, but if range checking is on
and we used a larger request, a whole extra 4096 bytes would be
allocated."
"These number are irrelevant to the new GNU malloc. I suspect it is
less sensitive to the size of the request."
�
https://github.com/jeffreykegler/Marpa--R2/blob/master/cpan/libmarpa/obs/marpa_obs.c

[2] Allocated memory contains an 8 or 16 byte overhead for the size of the
chunk and usage flags.
� http://en.wikipedia.org/wiki/C_dynamic_memory_allocation#dlmalloc

[3] If you allocate chunks with malloc, the chunk size should be a power
of 2.
� http://gcc.gnu.org/onlinedocs/libiberty/Obstack-Chunks.html

[4] the malloc in the GNU C Library does not round up block sizes to
powers of two, neither for large nor for small sizes
�
http://www.gnu.org/software/libc/manual/html_node/Efficiency-and-Malloc.html#Efficiency-and-Malloc

[5] The new GNU malloc improves on things by allocating large objects in
chunks of 4096 bytes rather than in ever larger powers of two, which results
in ever larger wastage.
� http://www.sxemacs.org/docs/internals/Low_002dlevel-allocation.html

[6] To the extent possible, this malloc manages memory from the system in
page-size units.

#    define malloc_getpagesize (4096) /* just guess */

malloc_getpagesize   (default: derived from system #includes)

�
http://web.mit.edu/course/13/13.715/build/lam-7.1.3/share/memory/ptmalloc/ptmalloc.c

[7]
Most operating systems allow programs to discover the page size at
runtime. This allows programs to use memory more efficiently by aligning
allocations to this size and reducing overall internal fragmentation of
pages.
�

Reini Urban

unread,
Jan 7, 2014, 12:51:19 PM1/7/14
to marpa-...@googlegroups.com
On Mon, Jan 6, 2014 at 6:32 PM, Jeffrey Kegler
<jeffre...@jeffreykegler.com> wrote:
> @Reini: I tried to find the potion code on your Github site but it didn't
> jump out at me. Could you point me at you mmap() code?

https://github.com/perl11/potion/blob/master/core/gc.c#L114
but it is tightly related to the GC, and you need a win32-support function:
https://github.com/perl11/potion/blob/master/core/contrib.c#L47

> Using mmap() seems like getting a bit too too close to the metal, but by the
> time I am devoting a lot of effort to second-quessing malloc() on its header
> sizes, it may make sense. After all, if what I really want is a page, why
> not just ask for a page?

ptmalloc3: http://www.malloc.de/en/

independent_comalloc
doc: https://github.com/rurban/perl-compiler/blob/release/lib/B/C.pm#L6416
and see the checker in
https://github.com/rurban/perl-compiler/blob/release/Makefile.PL#L189
> —
> https://github.com/jeffreykegler/Marpa--R2/blob/master/cpan/libmarpa/obs/marpa_obs.c
>
> [2] Allocated memory contains an 8 or 16 byte overhead for the size of the
> chunk and usage flags.
> — http://en.wikipedia.org/wiki/C_dynamic_memory_allocation#dlmalloc
>
> [3] If you allocate chunks with malloc, the chunk size should be a power
> of 2.
> — http://gcc.gnu.org/onlinedocs/libiberty/Obstack-Chunks.html
>
> [4] the malloc in the GNU C Library does not round up block sizes to
> powers of two, neither for large nor for small sizes
> —
> http://www.gnu.org/software/libc/manual/html_node/Efficiency-and-Malloc.html#Efficiency-and-Malloc
>
> [5] The new GNU malloc improves on things by allocating large objects in
> chunks of 4096 bytes rather than in ever larger powers of two, which results
> in ever larger wastage.
> — http://www.sxemacs.org/docs/internals/Low_002dlevel-allocation.html
>
> [6] To the extent possible, this malloc manages memory from the system in
> page-size units.
>
> # define malloc_getpagesize (4096) /* just guess */
>
> malloc_getpagesize (default: derived from system #includes)
>
> —
> http://web.mit.edu/course/13/13.715/build/lam-7.1.3/share/memory/ptmalloc/ptmalloc.c
>
> [7]
> Most operating systems allow programs to discover the page size at
> runtime. This allows programs to use memory more efficiently by aligning
> allocations to this size and reducing overall internal fragmentation of
> pages.
> —
Reply all
Reply to author
Forward
0 new messages