Patch to make string-append on win32 100 times faster

Wolfram Humann

unread,

Jul 30, 2010, 9:37:39 AM7/30/10

to perl5-...@perl.org

[sent first without being subscribed; didn't see message appearing in
any of the archvies; subscribed; resend now; hope I don't double-post...]

My first post to p5p -- let me know if I should have gone about this
differently. This discussion started on c.l.p.m., subject "Speed of
reading some MB of data using qx(...)", see
http://groups.google.com/group/comp.lang.perl.misc/browse_thread/thread/b7c9133ff20009f2.
I'll try to summarize my current understanding of matters:

When a string grows (e.g. gets appended to), perl calls sv_grow. When
sv_grow decides that the memory currently allocated to the string is
insufficient, it calls saferealloc. Depending on whether or not perl was
compiled with 'usemymalloc' this calls realloc in either perls internal
version or on the operating system. Perl requests from realloc just the
amount of memory required for the current operation. With 'usemymalloc'
this is not a problem because it rounds up memory allocation to a
certain geometric progression anyway. When the operating system's
realloc is called, this may or may not lead to desaster, depending on
how it's implemented. On win32 it does lead to desaster: when I loop
1000 times and each time append 1000 chars to an initial string size of
10 million, the memory grows from 10.000e6 to 10.001e6 to 10.002e6 and
so on 1000 times till it ends at 11.000e6.

I would like to propose that perl requests an additional amount of
memory from realloc when strings grow by a small amount. After
discussion on c.l.p.m. and with the help of ideas from Ilya Zakharevich
and Peter J. Holzer my proposal reads like this:
(I'm not a regular user of diff and using it wrong might do more harm
than good)

In sv.c the following section in Perl_sv_grow():

if (newlen > SvLEN(sv)) { /* need more room? */
#ifndef Perl_safesysmalloc_size
newlen = PERL_STRLEN_ROUNDUP(newlen);
#endif

should be expanded to:

#ifndef PERL_STRLEN_EXPAND_SHIFT
# define PERL_STRLEN_EXPAND_SHIFT 2
#endif

if (newlen > SvLEN(sv)) { /* need more room? */
size_t minlen = SvCUR(sv);
minlen += (minlen >> PERL_STRLEN_EXPAND_SHIFT) + 10;
if (newlen < minlen) newlen = minlen;
/* newlen += (newlen >> 1) + 10; */
#ifndef Perl_safesysmalloc_size
newlen = PERL_STRLEN_ROUNDUP(newlen);
#endif

I've posted a benchmark with results on c.l.p.m.. Therefore, two
examples may suffice here:
a) Starting from an initial string of length 1e7 chars, append 1e6 chars
in 10000 chunks of 100 chars each.
b) Read long, multilined output from another process, simulated here by
reading a 12 MB postscript file using $ps = qx(cat postscriptfile.ps)

On current Strawberry Perl (5.10 or 5.12):
a) 2.6 seconds
b) 18 seconds

Perl 5.12 compiled on win32 with modifications as above:
a) 0.014 seconds
b) 0.2 seconds

The value of PERL_STRLEN_EXPAND_SHIFT determines the balance between
extra memory use and speed (i.e. frequency of calls to realloc). "2"
seems to me like a decent compromise.

Wolfram

David Golden

unread,

Jul 30, 2010, 10:30:27 AM7/30/10

to Wolfram Humann, perl5-...@perl.org

On Fri, Jul 30, 2010 at 9:37 AM, Wolfram Humann <w.c.h...@arcor.de> wrote:
> I would like to propose that perl requests an additional amount of memory
> from realloc when strings grow by a small amount.

+1 -- though I'm not clear whether this win32 specific or whether all
operating systems should work this way.

> if (newlen > SvLEN(sv)) { /* need more room? */
> size_t minlen = SvCUR(sv);
> minlen += (minlen >> PERL_STRLEN_EXPAND_SHIFT) + 10;
> if (newlen < minlen) newlen = minlen;

[snip]

I have a few questions about the specific implementation. Please
don't take them as suggestions that your approach is necessarily wrong
-- but I'd like to understand the logic behind certain choices.

* Why are you suggesting making the extra allocation proportional to
the length of the string? Why not a multiple of the length of the
increment? Why not a fixed amount?

* Why the "+10"? Why have a extra minimum at all? Why "10" instead
of another number? Given your examples, for a large string the 10 is
a trivial fraction of the total and for a small string, the efficiency
either doesn't matter or the string quickly grows into a large string
anyway.

* Why do you think PERL_STRLEN_EXPAND_SHIFT should be 2? That's a 25%
increase in the length of the string. (In many programs, I suspect
many strings are appended to only a few times and that seems like
large memory price to pay.)

> On current Strawberry Perl (5.10 or 5.12):
> a) 2.6 seconds
> b) 18 seconds
>
> Perl 5.12 compiled on win32 with modifications as above:
> a) 0.014 seconds
> b) 0.2 seconds

Awesome! What does it look like with a shift of 3 or 4?

-- David

Marvin Humphrey

unread,

Jul 30, 2010, 11:34:10 AM7/30/10

to David Golden, Wolfram Humann, perl5-...@perl.org

On Fri, Jul 30, 2010 at 10:30:27AM -0400, David Golden wrote:
> On Fri, Jul 30, 2010 at 9:37 AM, Wolfram Humann <w.c.h...@arcor.de> wrote:
> > I would like to propose that perl requests an additional amount of memory
> > from realloc when strings grow by a small amount.
>
> +1 -- though I'm not clear whether this win32 specific or whether all
> operating systems should work this way.

IMO all should.

> -- but I'd like to understand the logic behind certain choices.

We discussed this kind of algorithm at length for Lucene and Lucy, including
exploration into what Python and Ruby do.

Dynamic array reallocation algorithms
http://markmail.org/thread/x427sku4wrnc3rjf

> * Why are you suggesting making the extra allocation proportional to
> the length of the string? Why not a multiple of the length of the
> increment? Why not a fixed amount?

IMO, the overage ought to be proportional to the space required by the final
string.

> * Why the "+10"? Why have a extra minimum at all? Why "10" instead
> of another number? Given your examples, for a large string the 10 is
> a trivial fraction of the total and for a small string, the efficiency
> either doesn't matter or the string quickly grows into a large string
> anyway.

That's similar to my analysis. I think it's more useful to round up to a
multiple of the pointer size for small values.

> * Why do you think PERL_STRLEN_EXPAND_SHIFT should be 2? That's a 25%
> increase in the length of the string. (In many programs, I suspect
> many strings are appended to only a few times and that seems like
> large memory price to pay.)

We decided to go with 12.5%, similar to Python.

http://markmail.org/message/obehcimh2ly4cd3l

I agree, a smaller size is better. Say you start with an array which hold 800
elements but which has been over-allocated by one eighth (+100 elements =
900). Reallocating at 900 and again at 1000-something isn't that different
from reallocating only once at 1000. So long as you avoid reallocating at
every increment -- 801, 802, etc -- you have achieved your main goal.

Both mild and aggressive over-allocation solve that problem, but aggressive
over-allocation also involves significant RAM costs. Where the best balance
lies depends on how bad the reallocation performance is in relation to the
cost of insertion. An eighth seems pretty good. Doubling seems too high. 1.5,
dunno, seems like it could work. According to this superficially
credible-seeming comment at stackoverflow, that's what Ruby does:

http://stackoverflow.com/questions/1100311/what-is-the-ideal-growth-rate-for-a-dynamically-allocated-array/1100449#1100449

Incidentally, the Python algo was originally intended to address poor
reallocation performance on Windows:

http://markmail.org/message/hk5kpmgcelvizkif

Marvin Humphrey

Wolfram Humann

unread,

Jul 30, 2010, 11:30:07 AM7/30/10

to David Golden, perl5-...@perl.org

On 30.07.2010 16:30, David Golden wrote:
> On Fri, Jul 30, 2010 at 9:37 AM, Wolfram Humann<w.c.h...@arcor.de> wrote:
>
>> I would like to propose that perl requests an additional amount of memory
>> from realloc when strings grow by a small amount.
>>
> +1 -- though I'm not clear whether this win32 specific or whether all
> operating systems should work this way.
>
>
>> if (newlen> SvLEN(sv)) { /* need more room? */
>> size_t minlen = SvCUR(sv);
>> minlen += (minlen>> PERL_STRLEN_EXPAND_SHIFT) + 10;
>> if (newlen< minlen) newlen = minlen;
>>
> [snip]
>
> I have a few questions about the specific implementation. Please
> don't take them as suggestions that your approach is necessarily wrong
> -- but I'd like to understand the logic behind certain choices.
>
> * Why are you suggesting making the extra allocation proportional to
> the length of the string? Why not a multiple of the length of the
> increment? Why not a fixed amount?
>

Take a) the postscript file of 12 MB. Say it has 120k lines of 100 chars
each.
Take b) a string of 100 chars and you append another 100 chars.
Multiply line-length by what?
Say 1000 -> max. memory waste 100kB. a) 120 reallocs, memory waste < 1%
-- o.k. b) 1 realloc, memory waste 99.9 % -- not acceptable
Say 10: case b) still wastes 80% of it's memory but a) has 12000
reallocs -- not acceptable
Fixed amounts are even worse. Try any value and see!

Geometric progression covers small and big things equally well. Also,
the time required for a realloc increases linearly with the amount of
memory that needs to be moved to a new location. So it does make sense
to realloc less often for big strings.

> * Why the "+10"? Why have a extra minimum at all? Why "10" instead
> of another number? Given your examples, for a large string the 10 is
> a trivial fraction of the total and for a small string, the efficiency
> either doesn't matter or the string quickly grows into a large string
> anyway.
>

It's debatable of it's required but if you start with an empty string
and add to that char-by-char it does reduce the amount of system calls
in the beginning substantially. I don't know how much it matters if you
have many very small strings (how much is the perl overhead for strings
already)?

> * Why do you think PERL_STRLEN_EXPAND_SHIFT should be 2? That's a 25%
> increase in the length of the string. (In many programs, I suspect
> many strings are appended to only a few times and that seems like
> large memory price to pay.)
>

I compared to Cygwin Perl on the same machine. It's compiled to use the
perl internal malloc/realloc. It grows memory even more aggressive than
my proposal with a shift of 2. The total memory required for the same
program in Cygwin Perl is often 30% to 50% bigger than in my modified
perl. Also, if you go from 2 to 4 you reduce *max.* overhead from 25% to
12.5% but you also quadruple the amount of reallocs required. This can
add a lot to the runtime of your program.
Also note, that the overhead only occurs if you add small amounts each
time. No overhead occurs if you increase the size of your original
string by 25 % or more in a single append.

>> On current Strawberry Perl (5.10 or 5.12):
>> a) 2.6 seconds
>> b) 18 seconds
>>
>> Perl 5.12 compiled on win32 with modifications as above:
>> a) 0.014 seconds
>> b) 0.2 seconds
>>
> Awesome! What does it look like with a shift of 3 or 4?
>
>

As I mentioned, I put a good amount of benchmark data in my post on
c.l.p.m. This also contains memory usage info. If you can, have a look
there. If that doesn't answer your question or you want me to crosspost
it here, let me know.

Wolfram

Ben Morrow

unread,

Jul 30, 2010, 11:49:50 AM7/30/10

to xda...@gmail.com, w.c.h...@arcor.de, perl5-...@perl.org

Quoth xda...@gmail.com (David Golden):

> On Fri, Jul 30, 2010 at 9:37 AM, Wolfram Humann <w.c.h...@arcor.de> wrote:
> > I would like to propose that perl requests an additional amount of memory
> > from realloc when strings grow by a small amount.
>
> +1 -- though I'm not clear whether this win32 specific or whether all
> operating systems should work this way.

I did some testing when we were discussing this on clpmisc, and found
that at least FreeBSD has the same poor performance as Win32 when perl
is built without -Dusemymalloc. (The issue is much more severe on Win32,
since it's impossible to build with both pseudofork and usemymalloc.)

Ben

Reini Urban

unread,

Jul 30, 2010, 12:04:24 PM7/30/10

to Marvin Humphrey, David Golden, Wolfram Humann, perl5-...@perl.org

Marvin Humphrey schrieb:

>> * Why the "+10"? Why have a extra minimum at all? Why "10" instead
>> of another number? Given your examples, for a large string the 10 is
>> a trivial fraction of the total and for a small string, the efficiency
>> either doesn't matter or the string quickly grows into a large string
>> anyway.
>
> That's similar to my analysis. I think it's more useful to round up to a
> multiple of the pointer size for small values.

I would rather align it to the pointer size, and keep an eye on the
pagesize.

I did similar tricks in Tie::CArray to get decent realloc
(array grow) performance.
http://cpansearch.perl.org/src/RURBAN/Tie-CArray-0.15/CArray.xs

#define PAGEBITS 11 /* we can also use 10 or 12, most system malloc to
12 ie 4096 */
#define MY_PAGESIZE (1 << PAGEBITS) /* 2048 byte is the size of a
fresh carray */
/* This is the to-tune part:
* The overall size should fit into a page or other malloc chunks.
* Leave room for "some" more items, but align it to the page size.
* Should small arrays (<100) be aligned at 2048 or smaller bounds?
* 10 => 2048-10, 2000 => 2048-2000, 200.000 => 2048
* len is the actual length of the array, size the itemsize in bytes.
*/
int freesize (int len, int size)
{
len *= size;
return max(MY_PAGESIZE-len, len - ((len >> PAGEBITS) << PAGEBITS))
/ size;
}
freesize is the slackspace to allocate additionally, and can be inlined.
grow by n:
ptr = saferealloc (ptr, len + freesize(len+n, itemsize));
--
Reini Urban
http://phpwiki.org/ http://murbreak.at/

Eric Brine

unread,

Jul 30, 2010, 12:14:00 PM7/30/10

to David Golden, Wolfram Humann, perl5-...@perl.org

On Fri, Jul 30, 2010 at 10:30 AM, David Golden <xda...@gmail.com> wrote:

> * Why are you suggesting making the extra allocation proportional to
> the length of the string?
>

Note that this is how array currently work. Push expands arrays using the
formula

new_size = old_size * 2 + 4

w.c.h...@arcor.de

unread,

Jul 30, 2010, 5:26:15 PM7/30/10

to Marvin Humphrey, perl5-...@perl.org

From: Marvin Humphrey <mar...@rectangular.com>
Date: 30.07.2010 17:34

>> * Why the "+10"? Why have a extra minimum at all? Why "10" instead
>> of another number? Given your examples, for a large string the 10 is
>> a trivial fraction of the total and for a small string, the efficiency
>> either doesn't matter or the string quickly grows into a large string
>> anyway.
>>
>
> That's similar to my analysis. I think it's more useful to round up to a
> multiple of the pointer size for small values.
>

If I understand correctly (which I'm not certain about), this is done
anyway right afterwards in
newlen = PERL_STRLEN_ROUNDUP(newlen);

Wolfram

Jim Cromie

unread,

Jul 31, 2010, 12:37:24 AM7/31/10

to w.c.h...@arcor.de, Marvin Humphrey, perl5-...@perl.org

what happens if you presize the variable 1st ?
then assign an empty string,
then start appending the additions.

this will avoid all the sv_grow calls,
all the reallocs, and just use the one-time preallocated memory.

this avoids all the mods to the heuristics youre proposing
(modest tho they might be), and has no unintended effects
on other benchmarkable use cases.

Theres some perl way to do this without explicitly assigning a large string
(as I did below), but I cannot recall it, and cant think of a suitably narrow
search term/strategy for it (hints welcome)

length($s) = 1000000; # isnt correct

[jimc@harpo perl]$ perl -MDevel::Peek -de1

main::(-e:1): 1
DB<1> $s='a'x10000;

DB<2> Dump $s
SV = PV(0x8c03df0) at 0x8c8c578
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x8ce9f48
"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa...
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"\0
CUR = 10000
LEN = 10004

DB<3> $s='a'

DB<4> Dump $s
SV = PV(0x8c03df0) at 0x8c8c578
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x8ce9f48 "a"\0
CUR = 1
LEN = 10004

Eric Brine

unread,

Jul 31, 2010, 1:14:02 AM7/31/10

to Jim Cromie, w.c.h...@arcor.de, Marvin Humphrey, perl5-...@perl.org

On Sat, Jul 31, 2010 at 12:37 AM, Jim Cromie <jim.c...@gmail.com> wrote:

> this avoids all the mods to the heuristics youre proposing
> (modest tho they might be), and has no unintended effects
> on other benchmarkable use cases.
>

The problem is not specific to building a giant string. Strings are grown
all the time.

Theres some perl way to do this without explicitly assigning a large string
>

It involves open(\$var)

David Golden

unread,

Jul 31, 2010, 7:44:18 AM7/31/10

to Eric Brine, Jim Cromie, w.c.h...@arcor.de, Marvin Humphrey, perl5-...@perl.org

On Sat, Jul 31, 2010 at 1:14 AM, Eric Brine <ike...@adaelis.com> wrote:
> Theres some perl way to do this without explicitly assigning a large string
>>
>
> It involves open(\$var)

Sounds like a new candidate for Scalar::Util.

-- David

w.c.h...@arcor.de

unread,

Jul 31, 2010, 8:33:35 AM7/31/10

to Jim Cromie, perl5-...@perl.org

From: Jim Cromie <jim.c...@gmail.com>
Date: 31.07.2010 06:37

> On Fri, Jul 30, 2010 at 3:26 PM, <w.c.h...@arcor.de> wrote:
>
>> From: Marvin Humphrey <mar...@rectangular.com>
>> Date: 30.07.2010 17:34
>>
>>>> * Why the "+10"? Why have a extra minimum at all? Why "10" instead
>>>> of another number? Given your examples, for a large string the 10 is
>>>> a trivial fraction of the total and for a small string, the efficiency
>>>> either doesn't matter or the string quickly grows into a large string
>>>> anyway.
>>>>
>>>>
>>> That's similar to my analysis. I think it's more useful to round up to a
>>> multiple of the pointer size for small values.
>>>
>>>
>> If I understand correctly (which I'm not certain about), this is done anyway
>> right afterwards in
>> newlen = PERL_STRLEN_ROUNDUP(newlen);
>>
>> Wolfram
>>
>>
>>
>
> what happens if you presize the variable 1st ?
> then assign an empty string,
> then start appending the additions.
>
> this will avoid all the sv_grow calls,
> all the reallocs, and just use the one-time preallocated memory.
>
> this avoids all the mods to the heuristics youre proposing
> (modest tho they might be), and has no unintended effects
> on other benchmarkable use cases.
>

There are many cases when you don't know the the final size of your
string. In that case you would need to invent a strategy similar to what
I proposed. Take my original script:

$ps_text = qx( pdftops some_pdf_file.pdf - );

The PDF file is about 2MB. I know it will grow during PDF -> Postscript
conversion. But it's hard to estimate how much it will grow. If I play
safe, I will likely waste much more memory than some well designed
perl-internal strategy.
Also, any perl compiled to use the internal malloc/realloc will "round
up to a certain geometric progression anyway" (Ilya Zakharevich). If you
look at Marvin Humphrey's post, you'll see that such a geometric growth
strategy (although with different details concerning acceptable memory
overhead etc.) is very common.

Wolfram

Nicholas Clark

unread,

Jul 31, 2010, 11:04:40 AM7/31/10

to w.c.h...@arcor.de, Marvin Humphrey, perl5-...@perl.org

That's certainly what the expression in PERL_STRLEN_ROUNDUP() is intended to
do. (Round up to a multiple of the pointer size, so that growth of strings by
small amounts usually avoids repeated realloc() calls for an extra 1 or 2
bytes).

I'm not sure what my opinion on the main proposal is.
(I don't have time to read and digest all my mail currently)

Nicholas Clark

Wolfram Humann

unread,

Jul 30, 2010, 6:00:17 AM7/30/10

to perl5-...@perl.org

My first post to p5p -- let me know if I should have gone about this
differently. This discussion started on c.l.p.m., subject "Speed of
reading some MB of data using qx(...)", see
http://groups.google.com/group/comp.lang.perl.misc/browse_thread/thread/b7c9133ff20009f2.
I'll try to summarize my current understanding of matters:

When a string grows (e.g. gets appended to), perl calls sv_grow. When
sv_grow decides that the memory currently allocated to the string is
insufficient, it calls saferealloc. Depending on whether or not perl was
compiled with 'usemymalloc' this calls realloc in either perls internal
version or on the operating system. Perl requests from realloc just the
amount of memory required for the current operation. With 'usemymalloc'
this is not a problem because it rounds up memory allocation to a
certain geometric progression anyway. When the operating system's
realloc is called, this may or may not lead to desaster, depending on
how it's implemented. On win32 it does lead to desaster: when I loop
1000 times and each time append 1000 chars to an initial string size of
10 million, the memory grows from 10.000e6 to 10.001e6 to 10.002e6 and
so on 1000 times till it ends at 11.000e6.

I would like to propose that perl requests an additional amount of

memory from realloc when strings grow by a small amount. After
discussion on c.l.p.m. and with the help of ideas from Ilya Zakharevich
and Peter J. Holzer my proposal reads like this:
(I'm not a regular user of diff and using it wrong might do more harm
than good)

In sv.c the following section in Perl_sv_grow():

if (newlen > SvLEN(sv)) { /* need more room? */

#ifndef Perl_safesysmalloc_size
newlen = PERL_STRLEN_ROUNDUP(newlen);
#endif

should be expanded to:

#ifndef PERL_STRLEN_EXPAND_SHIFT
# define PERL_STRLEN_EXPAND_SHIFT 2
#endif

if (newlen > SvLEN(sv)) { /* need more room? */

size_t minlen = SvCUR(sv);
minlen += (minlen >> PERL_STRLEN_EXPAND_SHIFT) + 10;
if (newlen < minlen) newlen = minlen;

/* newlen += (newlen >> 1) + 10; */
#ifndef Perl_safesysmalloc_size
newlen = PERL_STRLEN_ROUNDUP(newlen);
#endif

I've posted a benchmark with results on c.l.p.m.. Therefore, two
examples may suffice here:
a) Starting from an initial string of length 1e7 chars, append 1e6 chars
in 10000 chunks of 100 chars each.
b) Read long, multilined output from another process, simulated here by
reading a 12 MB postscript file using $ps = qx(cat postscriptfile.ps)

On current Strawberry Perl (5.10 or 5.12):

a) 2.6 seconds
b) 18 seconds

Perl 5.12 compiled on win32 with modifications as above:
a) 0.014 seconds
b) 0.2 seconds

The value of PERL_STRLEN_EXPAND_SHIFT determines the balance between

demerphq

unread,

Aug 2, 2010, 6:26:19 AM8/2/10

to w.c.h...@arcor.de, Marvin Humphrey, perl5-...@perl.org

On 31 July 2010 17:04, Nicholas Clark <ni...@ccl4.org> wrote:
> On Fri, Jul 30, 2010 at 11:26:15PM +0200, w.c.h...@arcor.de wrote:
>> From: Marvin Humphrey <mar...@rectangular.com>
>> Date: 30.07.2010 17:34
>> >>* Why the "+10"? Why have a extra minimum at all? Why "10" instead
>> >>of another number? Given your examples, for a large string the 10 is
>> >>a trivial fraction of the total and for a small string, the efficiency
>> >>either doesn't matter or the string quickly grows into a large string
>> >>anyway.
>> >>
>> >
>> >That's similar to my analysis. I think it's more useful to round up to a
>> >multiple of the pointer size for small values.
>> >
>> If I understand correctly (which I'm not certain about), this is done
>> anyway right afterwards in
>> newlen = PERL_STRLEN_ROUNDUP(newlen);
>
> That's certainly what the expression in PERL_STRLEN_ROUNDUP() is intended to
> do. (Round up to a multiple of the pointer size, so that growth of strings by
> small amounts usually avoids repeated realloc() calls for an extra 1 or 2
> bytes).

When i saw this I was confused, as I remember when Nicholas wrote the
patch that added PERL_STRLEN_ROUNDUP(), and Im wondering why this
patch doesnt modify /that/ instead of adding additional logic /in
front/ which will then be mixed together....

IOW, it seems like this patch will add some percentage, and then
PERL_STRLEN_ROUNDUP will add another percentage of /that/.

This doesnt seem right. Whats the problem PERL_STRLEN_ROUNDUP(), and,
given that PERL_STRLEN_ROUNDUP() is guarded like so:

#ifndef Perl_safesysmalloc_size
newlen = PERL_STRLEN_ROUNDUP(newlen);
#endif

why should this new logic not also be guarded? Surely the intent is to
have a mode where Perl doesnt mess with requested size at all?

cheers,
yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

demerphq

unread,

Aug 2, 2010, 6:36:04 AM8/2/10

to David Golden, Eric Brine, Jim Cromie, w.c.h...@arcor.de, Marvin Humphrey, perl5-...@perl.org

Here are three ways to do it in perl. The first is interesting as it
works slightly differently in that it enables the OOK flag as well,
which IMO is a bug in substr().

Yves

$ perl -MDevel::Peek -e'$size=10; $str=" " x $size;
substr($str,0,$size,""); Dump($str); $str.="ab"; Dump($str);'
SV = PVIV(0x8ac9fd4) at 0x8ad40c8
REFCNT = 1
FLAGS = (POK,OOK,pPOK)
IV = 10 (OFFSET)
PV = 0x8acafa2 ( " " . ) ""\0
CUR = 0
LEN = 2
SV = PVIV(0x8ac9fd4) at 0x8ad40c8

REFCNT = 1
FLAGS = (POK,pPOK)

IV = 0
PV = 0x8acaf98 "ab"\0
CUR = 2
LEN = 12
$ perl -MDevel::Peek -e'$size=10; $str=" " x $size; $str=~s/.+//g;
Dump($str); $str.="ab"; Dump($str);'
SV = PV(0x92b1748) at 0x92d10c0

REFCNT = 1
FLAGS = (POK,pPOK)

PV = 0x92c7f98 ""\0
CUR = 0
LEN = 12
SV = PV(0x92b1748) at 0x92d10c0

REFCNT = 1
FLAGS = (POK,pPOK)

PV = 0x92c7f98 "ab"\0
CUR = 2
LEN = 12
$ perl -MDevel::Peek -e'$size=10; $str=" " x $size; chop($str) while
length($str); Dump($str); $str.="ab"; Dump($str);'
SV = PV(0x928c748) at 0x92ac0d0

REFCNT = 1
FLAGS = (POK,pPOK)

PV = 0x92a2f98 ""\0
CUR = 0
LEN = 12
SV = PV(0x928c748) at 0x92ac0d0

REFCNT = 1
FLAGS = (POK,pPOK)

PV = 0x92a2f98 "ab"\0
CUR = 2
LEN = 12

H.Merijn Brand

unread,

Aug 2, 2010, 7:00:23 AM8/2/10

to demerphq, David Golden, Eric Brine, Jim Cromie, w.c.h...@arcor.de, Marvin Humphrey, perl5-...@perl.org

On Mon, 2 Aug 2010 12:36:04 +0200, demerphq <deme...@gmail.com> wrote:

> On 31 July 2010 13:44, David Golden <xda...@gmail.com> wrote:
> > On Sat, Jul 31, 2010 at 1:14 AM, Eric Brine <ike...@adaelis.com> wrote:
> >> Theres some perl way to do this without explicitly assigning a large string
> >>>
> >>
> >> It involves open(\$var)
> >
> > Sounds like a new candidate for Scalar::Util.
>
> Here are three ways to do it in perl. The first is interesting as it
> works slightly differently in that it enables the OOK flag as well,
> which IMO is a bug in substr().

Fourth way is implemented in Data::Peek

$ perl -MDP -e'DDump $_;DGrow $_, 102400; DDump $_'
SV = NULL(0x0) at 0x8301568
REFCNT = 1
FLAGS = ()
SV = PV(0x82d1114) at 0x8301568

REFCNT = 1
FLAGS = (POK,pPOK)

PV = 0x83d7558 ""\0
CUR = 0
LEN = 102400

my $LEN = DGrow ($pv, $size)
Fastest way to preallocate space for a PV scalar. Returns the allocated
length. If $size is smaller than the already allocated space, it will
not shrink.

cmpthese (-2, {
pack => q{my $x = ""; $x = pack "x20000"; $x = "";},
op_x => q{my $x = ""; $x = "x" x 20000; $x = "";},
grow => q{my $x = ""; DGrow ($x, 20000); $x = "";},
});

Rate op_x pack grow
op_x 62127/s -- -59% -96%
pack 152046/s 145% -- -91%
grow 1622943/s 2512% 967% --

XS code:

void
DGrow (sv, size)
SV *sv
IV size

PROTOTYPE: $$
PPCODE:
if (SvROK (sv))
sv = SvRV (sv);
if (!SvPOK (sv))
sv_setpvn (sv, "", 0);
SvGROW (sv, size);
mPUSHi (SvLEN (sv));
/* XS DGrow */

--
H.Merijn Brand http://tux.nl Perl Monger http://amsterdam.pm.org/
using 5.00307 through 5.12 and porting perl5.13.x on HP-UX 10.20, 11.00,
11.11, 11.23, and 11.31, OpenSuSE 10.3, 11.0, and 11.1, AIX 5.2 and 5.3.
http://mirrors.develooper.com/hpux/ http://www.test-smoke.org/
http://qa.perl.org http://www.goldmark.org/jeff/stupid-disclaimers/

Jan Dubois

unread,

Aug 13, 2010, 8:39:53 PM8/13/10

to Wolfram Humann, perl5-...@perl.org

On Fri, 30 Jul 2010, Wolfram Humann wrote:

[...]

> I would like to propose that perl requests an additional amount of
> memory from realloc when strings grow by a small amount. After
> discussion on c.l.p.m. and with the help of ideas from Ilya Zakharevich
> and Peter J. Holzer my proposal reads like this:
> (I'm not a regular user of diff and using it wrong might do more harm
> than good)
>
> In sv.c the following section in Perl_sv_grow():
>
> if (newlen > SvLEN(sv)) { /* need more room? */
> #ifndef Perl_safesysmalloc_size
> newlen = PERL_STRLEN_ROUNDUP(newlen);
> #endif
>
> should be expanded to:
>
> #ifndef PERL_STRLEN_EXPAND_SHIFT
> # define PERL_STRLEN_EXPAND_SHIFT 2
> #endif
>
> if (newlen > SvLEN(sv)) { /* need more room? */
> size_t minlen = SvCUR(sv);
> minlen += (minlen >> PERL_STRLEN_EXPAND_SHIFT) + 10;
> if (newlen < minlen) newlen = minlen;
> /* newlen += (newlen >> 1) + 10; */
> #ifndef Perl_safesysmalloc_size
> newlen = PERL_STRLEN_ROUNDUP(newlen);
> #endif

Thank you for your patch! I've now committed a slightly cleaned up version
to the Perl repository, and added your email address to the AUTHORS file:

http://perl5.git.perl.org/perl.git/commitdiff/f12005599f648e675af22dfef1047191e260bc48
http://perl5.git.perl.org/perl.git/commitdiff/1fd8fa480e535be8885c2c1b2e7e95c3aabbdf7f

Beyond formatting changes I did:

* change size_t to STRLEN for internal consistency
* removed the commented out code
* moved the PERL_STRLEN_EXPAND_SHIFT definition to perl.h

The discussion of this change seemed to have stalled, but I see
+1 votes from David Golden and Marvin Humphrey, with additional
information from Ben Morrow that the patch also helps on FreeBSD
(when building without -Dusemymalloc), with nobody voicing
any disagreement about applying the patch.

I've kept the original constants (PERL_STRLEN_EXPAND_SHIFT being
2, plus the minimum addition of 10). While the shift overallocates
by 25%, this is based on the original length of the buffer, not
the requested new length. The reported 12.5% overallocation used
by Python (for arrays, not strings) is based on the new size already,
not the old.

The advantage of this algorithm is also that we don't overallocate
at all if the buffer already grows by more than 25%, so we won't
"waste" any space if 2 long strings are concatenated, only small
strings are repeatedly concatenated to an already long one.

I don't care much either way about the constant 10, but the price
for it is small: on a 32-bit system a null-string SV already
occupies 28 bytes (24 + 1*4), so once it grows to 32 bytes,
the "10" will actually expand it to 36 (24 + 3*4) bytes instead,
which doesn't matter either way.

Things that could benefit from further discussion are:

a) the algorithm in sv_grow() for expanding SVs with SvOOK

if (newlen > SvLEN(sv))
newlen += 10 * (newlen - SvCUR(sv)); /* avoid copy each time */

This doesn't make much sense, and is catastrophic when
you append a very long string to a short one. I have not looked
into the heritage of this algorithm yet.

b) Should the new over-allocation also be used under -Dusemyalloc. It
will provide less benefit there, but also come at a lower cost because
the newlen can still fall inside the already allocated bucket size
anyways. I don't have any strong opinion if it should be disabled
in this situation or not.

Cheers,
-Jan

Reini Urban

unread,

Aug 15, 2010, 11:51:31 AM8/15/10

to Jan Dubois, Wolfram Humann, perl5-...@perl.org

Jan Dubois schrieb:

This particular slowdown was only recognized for WIN32 native malloc,
but not for other platforms with better malloc libs.
Those platforms are now hurt by using overallocation,
i.e. need more memory, e.g. with piping.

Why was this patch not applied with the appropriate
#if defined(_WIN32) or what is used for MSVC and mingw?

> a) the algorithm in sv_grow() for expanding SVs with SvOOK
>
> if (newlen> SvLEN(sv))
> newlen += 10 * (newlen - SvCUR(sv)); /* avoid copy each time */
>
> This doesn't make much sense, and is catastrophic when
> you append a very long string to a short one. I have not looked
> into the heritage of this algorithm yet.

--
Reini Urban

Jan Dubois

unread,

Aug 16, 2010, 2:19:25 AM8/16/10

to Reini Urban, Wolfram Humann, perl5-...@perl.org

On Sun, 15 Aug 2010, Reini Urban wrote:
> Jan Dubois schrieb:
> > On Fri, 30 Jul 2010, Wolfram Humann wrote:
> > The discussion of this change seemed to have stalled, but I see
> > +1 votes from David Golden and Marvin Humphrey, with additional
> > information from Ben Morrow that the patch also helps on FreeBSD
> > (when building without -Dusemymalloc), with nobody voicing
> > any disagreement about applying the patch.
>
> This particular slowdown was only recognized for WIN32 native malloc,
> but not for other platforms with better malloc libs.

Did you read the paragraph you quoted above? It explicitly claims that
the slowdown happens on other platforms when using the platform native
malloc.

> Those platforms are now hurt by using overallocation,
> i.e. need more memory, e.g. with piping.

Could you provide some evidence for this claim? The only way a
"better malloc" can prevent this slowdown is by doing some kind
of overallocation itself. Since the algorithm in this patch
is not necessarily cumulative with the overallocation by malloc()
it is very well possible that the change has no effect at all
on systems with a overallocating malloc().

For example, assume Perl is appending 100 bytes to a 1000 bytes string.
The patch under discussion will make sure that sv_grow() will request
1260 bytes (1000 + 1000>>2 + 10) instead of just 1100 bytes from realloc().

Assume the original 1000 bytes were allocated in a 1024 bytes slab in
the allocator, so the 1100 bytes wouldn't fit in anymore, and realloc()
will now move this to e.g. a 1536 byte slab. In that case it doesn't
make any difference that we now asked for 1260 bytes instead of 1100.

Also note that the minimum growth is based on the old buffer size,
so appending 300 bytes to the 1000 byte string will only request
1300 bytes, because 1300 is already larger than the 1260 minimum
realloc growth.

So until proven otherwise I doubt that this patch has any noticeable
effect on a "good malloc()". On a "medium malloc()" I would expect
it to improve performance somewhat, at a moderate additional memory
requirement.

> Why was this patch not applied with the appropriate
> #if defined(_WIN32) or what is used for MSVC and mingw?

Because then it wouldn't be applied to other platforms, like FreeBSD.

Note thought that I added this remark for discussion about disabling it
under -Dusemyalloc:

| b) Should the new over-allocation also be used under -Dusemyalloc. It
| will provide less benefit there, but also come at a lower cost because
| the newlen can still fall inside the already allocated bucket size
| anyways. I don't have any strong opinion if it should be disabled
| in this situation or not.

But as I stated above, I don't think it will make much of a difference either
way for the -Dusemyalloc case, but would love to see some comprehensive
benchmarks.

Cheers,
-Jan

Reini Urban

unread,

Aug 16, 2010, 7:18:48 AM8/16/10

to Jan Dubois, Wolfram Humann, perl5-...@perl.org

2010/8/16 Jan Dubois <ja...@activestate.com>:

> On Sun, 15 Aug 2010, Reini Urban wrote:
>> Jan Dubois schrieb:
>> > On Fri, 30 Jul 2010, Wolfram Humann wrote:
>> > The discussion of this change seemed to have stalled, but I see
>> > +1 votes from David Golden and Marvin Humphrey, with additional
>> > information from Ben Morrow that the patch also helps on FreeBSD
>> > (when building without -Dusemymalloc), with nobody voicing
>> > any disagreement about applying the patch.
>>
>> This particular slowdown was only recognized for WIN32 native malloc,
>> but not for other platforms with better malloc libs.
>
> Did you read the paragraph you quoted above? It explicitly claims that
> the slowdown happens on other platforms when using the platform native
> malloc.
>
>> Those platforms are now hurt by using overallocation,
>> i.e. need more memory, e.g. with piping.
>
> Could you provide some evidence for this claim? The only way a
> "better malloc" can prevent this slowdown is by doing some kind
> of overallocation itself. Since the algorithm in this patch
> is not necessarily cumulative with the overallocation by malloc()
> it is very well possible that the change has no effect at all
> on systems with a overallocating malloc().

This is just theory. I KNOW that plain gnumalloc and freebsd realloc
do work fine.
But now that we have the mess someone can test it. I don't have such systems
so I cannot test it.

> For example, assume Perl is appending 100 bytes to a 1000 bytes string.
> The patch under discussion will make sure that sv_grow() will request
> 1260 bytes (1000 + 1000>>2 + 10) instead of just 1100 bytes from realloc().
>
> Assume the original 1000 bytes were allocated in a 1024 bytes slab in
> the allocator, so the 1100 bytes wouldn't fit in anymore, and realloc()
> will now move this to e.g. a 1536 byte slab. In that case it doesn't
> make any difference that we now asked for 1260 bytes instead of 1100.
>
> Also note that the minimum growth is based on the old buffer size,
> so appending 300 bytes to the 1000 byte string will only request
> 1300 bytes, because 1300 is already larger than the 1260 minimum
> realloc growth.
>
> So until proven otherwise I doubt that this patch has any noticeable
> effect on a "good malloc()". On a "medium malloc()" I would expect
> it to improve performance somewhat, at a moderate additional memory
> requirement.
>
>> Why was this patch not applied with the appropriate
>> #if defined(_WIN32) or what is used for MSVC and mingw?
>
> Because then it wouldn't be applied to other platforms, like FreeBSD.

Uuh, nobody ever complained about freebsd realloc performance.
It was always the fastest on the planet and still is.

> Note thought that I added this remark for discussion about disabling it
> under -Dusemyalloc:
>
> | b) Should the new over-allocation also be used under -Dusemyalloc. It
> | will provide less benefit there, but also come at a lower cost because
> | the newlen can still fall inside the already allocated bucket size
> | anyways. I don't have any strong opinion if it should be disabled
> | in this situation or not.
>
> But as I stated above, I don't think it will make much of a difference either
> way for the -Dusemyalloc case, but would love to see some comprehensive
> benchmarks.

At http://groups.google.com/group/comp.lang.perl.misc/browse_thread/thread/b7c9133ff20009f2?pli=1
were a lot of benchmarks and profiling data.
He is testing string sizes of 1e5-1e7 byte not just pagesizes, piping
a typical pdf to perl.

Plain freebsd had for years the best realloc performance (first with
phkmalloc, now
they switched to jemalloc for better multi-core performance), which were always
faster than simple gnumalloc (ptmalloc 2 or 3 based on Doug Lea's
public domain malloc),
but "gnumalloc" was fast enough. 12ms against 16sec.

Without this patch.

So I don't see any reason to "fix" gnumalloc or freebsd realloc
default behaviour, unless
someone reports problems there. Tuning realloc is black art ( I did it
for Tie::CArray once )
and don't want someone to touch this without any tests.

Only msvcrt is affected and so only platforms which use msvcrt realloc should
be patched, esp. without any tests on the other platforms.

FYI: cygwin (newlib) uses normally freebsd derived libc
implementations, but in this case
cygwin uses not phkmalloc, just ptmalloc2.6.4 i.e. plain gnumalloc.
- which has no public independent_comalloc() which is a shame btw.

-Dusemymalloc is overallocating a lot, but we know that. This is not a
realloc problem per se,
plain malloc does the same.
--
Reini Urban

David Golden

unread,

Aug 16, 2010, 9:37:06 AM8/16/10

to Reini Urban, Wolfram Humann, Jan Dubois, perl5-...@perl.org

I think I'm with Reini on this one. I'd rather keep it Windows-only until a
performance problem is documented elsewhere.

It's easy to enable for other platforms when needed.

I'm open to people doing the research before 5.14, of course. :-)

David

Ben Morrow

unread,

Aug 16, 2010, 1:19:30 PM8/16/10

to rur...@x-ray.at, perl5-...@perl.org

Quoth rur...@x-ray.at (Reini Urban):
> 2010/8/16 Jan Dubois <ja...@activestate.com>:

> >
> > Could you provide some evidence for this claim? The only way a
> > "better malloc" can prevent this slowdown is by doing some kind
> > of overallocation itself. Since the algorithm in this patch
> > is not necessarily cumulative with the overallocation by malloc()
> > it is very well possible that the change has no effect at all
> > on systems with a overallocating malloc().
>
> This is just theory. I KNOW that plain gnumalloc and freebsd realloc
> do work fine.
> But now that we have the mess someone can test it. I don't have such systems
> so I cannot test it.

The test I referred to running under FreeBSD is at
http://groups.google.com/group/comp.lang.perl.misc/msg/9e373f4d1e755456
, and the benchmark in question is at
http://groups.google.com/group/comp.lang.perl.misc/msg/1e3c56e7c7360de3
. The difference between -Dusemymalloc and not is *huge*, for instance

5.8.8, system malloc
1E7 chars + 1E4 x 1E2 chars: 7159.0 ms
5.8.8, mymalloc
1E7 chars + 1E4 x 1E2 chars: 45.3 ms

Now, it's possible that benchmark is faulty, or that perl's allocation
strategy has changed significantly since 5.8.8, but you would need to
give some reason for believing that before blindly asserting that
FreeBSD's malloc is the fastest thing in the world and nothing could
possibly improve its performance.

> Uuh, nobody ever complained about freebsd realloc performance.
> It was always the fastest on the planet and still is.

Not for perl's usage patterns, apparently. This is borne out by the fact
that ports perl has been built with -Dusemymalloc forever.

> > Note thought that I added this remark for discussion about disabling it
> > under -Dusemyalloc:
> >
> > | b) Should the new over-allocation also be used under -Dusemyalloc. It
> > | will provide less benefit there, but also come at a lower cost because
> > | the newlen can still fall inside the already allocated bucket size
> > | anyways. I don't have any strong opinion if it should be disabled
> > | in this situation or not.
> >
> > But as I stated above, I don't think it will make much of a difference either
> > way for the -Dusemyalloc case, but would love to see some comprehensive
> > benchmarks.
>
> At
> http://groups.google.com/group/comp.lang.perl.misc/browse_thread/thread/b7c9133ff20009f2?pli=1
> were a lot of benchmarks and profiling data.
> He is testing string sizes of 1e5-1e7 byte not just pagesizes, piping
> a typical pdf to perl.

Are you saying this is a bad benchmark, for some reason? If you can
supply a better I'd be happy to run it for you.

Ben

Jan Dubois

unread,

Aug 16, 2010, 3:17:41 PM8/16/10

to Reini Urban, Wolfram Humann, perl5-...@perl.org

On Mon, 16 Aug 2010, Reini Urban wrote:
> 2010/8/16 Jan Dubois <ja...@activestate.com>:

> > Could you provide some evidence for this claim? The only way a
> > "better malloc" can prevent this slowdown is by doing some kind
> > of overallocation itself. Since the algorithm in this patch
> > is not necessarily cumulative with the overallocation by malloc()
> > it is very well possible that the change has no effect at all
> > on systems with a overallocating malloc().
>
> This is just theory. I KNOW that plain gnumalloc and freebsd realloc
> do work fine.
> But now that we have the mess someone can test it. I don't have such systems
> so I cannot test it.

I'm getting a bit tired of you discarding Ben's actual test based on
some "knowledge" you imply to have that you can't even verify because
you don't have access to those systems. This is a technical issue, not
a religious belief!

[...]

> > Because then it wouldn't be applied to other platforms, like FreeBSD.
>
> Uuh, nobody ever complained about freebsd realloc performance.
> It was always the fastest on the planet and still is.

I pointed out *twice* to you that Ben did actually measure it and came up
with pretty bad numbers for the freebsd reallocator. You will need to come
up with some evidence why his benchmarks should be discarded.

To provide similar numbers for Linux and GNU malloc I've compiled bleadperl
just before and after the patch in question, both with usemymalloc=y
and usemymalloc=n. The results are just for the "1E7 chars + 1E5 x 1E1 chars"
benchmark, as that is the slowest of the bunch. I've run the benchmark
script 100 times for each Perl build and show the min/max runtimes to show
that there is quite a bit of noise:

Before, GNU malloc: Min=38.6 Max=45.4 Avg=41.20
Before, Perl malloc: Min= 7.9 Max=14.2 Avg=11.14

After, GNU malloc: Min= 9.7 Max=12.7 Avg=11.45
After, Perl malloc: Min= 9.4 Max=13.2 Avg=11.16

It shows that GNU malloc on its own takes 4 times as long as GNU malloc with
the patch. GNU malloc with this patch matches the time used by the Perl
malloc (usemymalloc=y), which doesn't seem to be affected by the patch.

> > Note thought that I added this remark for discussion about disabling it
> > under -Dusemyalloc:
> >
> > | b) Should the new over-allocation also be used under -Dusemyalloc. It
> > | will provide less benefit there, but also come at a lower cost because
> > | the newlen can still fall inside the already allocated bucket size
> > | anyways. I don't have any strong opinion if it should be disabled
> > | in this situation or not.
> >
> > But as I stated above, I don't think it will make much of a difference either
> > way for the -Dusemyalloc case, but would love to see some comprehensive
> > benchmarks.
>
> At http://groups.google.com/group/comp.lang.perl.misc/browse_thread/thread/b7c9133ff20009f2?pli=1
> were a lot of benchmarks and profiling data.
> He is testing string sizes of 1e5-1e7 byte not just pagesizes, piping
> a typical pdf to perl.

That is exactly the script I have been running in my tests above.

> Plain freebsd had for years the best realloc performance (first with
> phkmalloc, now
> they switched to jemalloc for better multi-core performance), which were always
> faster than simple gnumalloc (ptmalloc 2 or 3 based on Doug Lea's
> public domain malloc),
> but "gnumalloc" was fast enough. 12ms against 16sec.
>
> Without this patch.

See above for a comparison between GNU malloc and Perl malloc. Note that
the Google thread you reference *also* shows that Perl malloc performs
3 times better than GNU malloc at this particular benchmark.

> So I don't see any reason to "fix" gnumalloc or freebsd realloc
> default behaviour, unless
> someone reports problems there. Tuning realloc is black art ( I did it
> for Tie::CArray once )
> and don't want someone to touch this without any tests.

Nobody is tuning realloc(), this is all about improving the performance of
sv_grow().

> Only msvcrt is affected and so only platforms which use msvcrt realloc should
> be patched, esp. without any tests on the other platforms.

This is really getting old now...

> FYI: cygwin (newlib) uses normally freebsd derived libc
> implementations, but in this case
> cygwin uses not phkmalloc, just ptmalloc2.6.4 i.e. plain gnumalloc.
> - which has no public independent_comalloc() which is a shame btw.
>
> -Dusemymalloc is overallocating a lot, but we know that. This is not a
> realloc problem per se, plain malloc does the same.

You seem to have Cygwin available to you. Why don't you just test the
patch and report on any actual problems instead of spreading bad attitude?

Cheers,
-Jan

Wolfram Humann

unread,

Aug 16, 2010, 6:39:44 PM8/16/10

to Jan Dubois, perl5-...@perl.org

On 16.08.2010 21:17, Jan Dubois wrote:
> To provide similar numbers for Linux and GNU malloc I've compiled
> bleadperl
> just before and after the patch in question, both with usemymalloc=y
> and usemymalloc=n. The results are just for the "1E7 chars + 1E5 x
> 1E1 chars"
> benchmark, as that is the slowest of the bunch. I've run the benchmark
> script 100 times for each Perl build and show the min/max runtimes to
> show
> that there is quite a bit of noise:
>
> Before, GNU malloc: Min=38.6 Max=45.4 Avg=41.20
> Before, Perl malloc: Min= 7.9 Max=14.2 Avg=11.14
>
> After, GNU malloc: Min= 9.7 Max=12.7 Avg=11.45
> After, Perl malloc: Min= 9.4 Max=13.2 Avg=11.16
>
> It shows that GNU malloc on its own takes 4 times as long as GNU
> malloc with
> the patch. GNU malloc with this patch matches the time used by the Perl
> malloc (usemymalloc=y), which doesn't seem to be affected by the patch.

Hm, the differences I get on my Linux go in the same direction but are
less impressive. On an unpatched perl I get around 12 ms with
usemymalloc=n and around 8 ms with usemymalloc=y.

But I also tried to follow up on Peter J. Holzer's remarks in
<http://groups.google.com/group/comp.lang.perl.misc/msg/8747b64794aec27b>.
He used strace for a closer look at what GNU malloc does and found that
it leaves free space after allocated memory so that it can often realloc
without moving memory. This free space (as far as I understand form
watching private memory usage reported by Linux::Smaps) is not
considered part of perl's memory ntil actually used. He also suggests
that this algorithm -- like most "clever" ones -- can be broken. Looks
like it can:
(All on an unpatched perl 5.10.1 with usemymalloc=n and usemymalloc=y.
Starting with an empty my @ar2;. Code given is an additional line in my
array of tests, but could also run stand-alone)

1) 'append $i' => sub{ for my $i (1..1E3){ $ar2[$i] .= $c1E2 for 1..1E3 } }
That's the 'good' case: Append 1000 times to one string, move to the
next string, append 1000 times to that one, move to the next...
GNU malloc: 238.8 ms
Perl malloc: 164.8 ms

2) 'append $_' => sub{ for my $i (1..1E3){ $ar2[$_] .= $c1E2 for 1..1E3 } }
That's the 'bad' case: append to each string once, than again to each
string,...
GNU malloc: 57143.0 ms
Perl malloc: 298.0 ms

Jan, if possible, could you run that on your before-and-after-the-patch
compiles?

Wolfram

Jan Dubois

unread,

Aug 16, 2010, 6:47:39 PM8/16/10

to Wolfram Humann, perl5-...@perl.org

On Mon, 16 Aug 2010, Wolfram Humann wrote:
> (All on an unpatched perl 5.10.1 with usemymalloc=n and usemymalloc=y.
> Starting with an empty my @ar2;. Code given is an additional line in my
> array of tests, but could also run stand-alone)
>
> 1) 'append $i' => sub{ for my $i (1..1E3){ $ar2[$i] .= $c1E2 for 1..1E3 } }
> That's the 'good' case: Append 1000 times to one string, move to the
> next string, append 1000 times to that one, move to the next...
> GNU malloc: 238.8 ms
> Perl malloc: 164.8 ms
>
> 2) 'append $_' => sub{ for my $i (1..1E3){ $ar2[$_] .= $c1E2 for 1..1E3 } }
> That's the 'bad' case: append to each string once, than again to each
> string,...
> GNU malloc: 57143.0 ms
> Perl malloc: 298.0 ms
>
> Jan, if possible, could you run that on your before-and-after-the-patch
> compiles?

Before, GNU alloc:
append $i: 220.4 ms
append $_: 12879.6 ms

Before, Perl alloc:
append $i: 200.4 ms
append $_: 297.6 ms

After, GNU alloc:
append $i: 190.9 ms
append $_: 282.8 ms

After, Perl alloc:
append $i: 198.2 ms
append $_: 269.7 ms

So I guess as expected: GNU malloc improve a lot, Perl malloc is unaffected.

I haven't looked at total memory consumption for these tests, so I can't
confirm that the memory usage for Perl malloc() is unchanged. I strongly
suspect that to be the case though.

Cheers,
-Jan

Jan Dubois

unread,

Aug 16, 2010, 7:09:13 PM8/16/10

to David Golden, Reini Urban, Wolfram Humann, perl5-...@perl.org

On Mon, 16 Aug 2010, David Golden wrote:
>
> I think I'm with Reini on this one. I'd rather keep it Windows-only until a
> performance problem is documented elsewhere.

It's been documented that the patch improves performance on FreeBSD by a
*factor* of 160 for one particular test case, as well as a factor of 4
on Linux. In the meantime Wolfram has constructed another testcase where
the patch improves performance on Linux by a factor of 45 (my system) to
230 (Wolfram's system).

Granted, these are worst-case samples, but at least the first test
program was derived from a real-world application: appending lots of
small strings to an ever-growing large buffer.

The performance improvements have only been observed with
usemymalloc=n, bringing that case up to par with the usemymalloc=y case
performance-wise.

The usemymalloc=y case seems (as expected) unaffected by this change. So
disabling the code for that case won't make a difference. I generally
prefer to keep code as simple as possible though, so I'm not sure if the
patch needs to be disabled for usemymalloc=y. This is more a
philosophical than a technical issue.

> It's easy to enable for other platforms when needed.

So far the patch has improved performance for *every* platform malloc() it
has been tried on. The only malloc() that doesn't improve by it is the
Perl malloc from Ilya.

> I'm open to people doing the research before 5.14, of course. :-)

Given the way the algorithm works, and that it has shown to improve
performance on Windows, Linux and FreeBSD I claim it is now time for
the detractors to find a system where the change has any negative
consequences.

Cheers,
-Jan

Ben Morrow

unread,

Aug 16, 2010, 7:25:05 PM8/16/10

to w.c.h...@arcor.de, perl5-...@perl.org

Quoth w.c.h...@arcor.de (Wolfram Humann):

>
> But I also tried to follow up on Peter J. Holzer's remarks in
> <http://groups.google.com/group/comp.lang.perl.misc/msg/8747b64794aec27b>.
> He used strace for a closer look at what GNU malloc does and found that
> it leaves free space after allocated memory so that it can often realloc
> without moving memory.

Does glibc have a malloc_size() function (or equivalent), and does perl
use it? You can check with perl -V:'d_malloc.*_size'. If there is such a
function but perl isn't finding it it may be worth fixing that.

FreeBSD appears to have an equivalent function called malloc_usable_size,
defined in <malloc_np.h>, which perl doesn't know about. When I get a
chance I'll try hacking it in to see if it makes a difference (I don't
think I've a hope of doing a proper Configure patch for it :().

> This free space (as far as I understand form
> watching private memory usage reported by Linux::Smaps) is not
> considered part of perl's memory ntil actually used.

This is usual. Mapped memory that hasn't been touched yet isn't actually
allocated anywhere by the kernel. (Linux is more aggressive about
allowing over-allocation from userland than some other OSen.)

Ben

David Golden

unread,

Aug 16, 2010, 8:20:25 PM8/16/10

to Jan Dubois, Reini Urban, Wolfram Humann, perl5-...@perl.org

On Mon, Aug 16, 2010 at 7:09 PM, Jan Dubois <ja...@activestate.com> wrote:
> On Mon, 16 Aug 2010, David Golden wrote:
>>
>> I think I'm with Reini on this one. I'd rather keep it Windows-only until a
>> performance problem is documented elsewhere.
>
> It's been documented that the patch improves performance on FreeBSD by a
> *factor* of 160 for one particular test case, as well as a factor of 4
> on Linux. In the meantime Wolfram has constructed another testcase where
> the patch improves performance on Linux by a factor of 45 (my system) to
> 230 (Wolfram's system).

+1

-- David

Zefram

unread,

Aug 17, 2010, 1:45:09 AM8/17/10

to perl5-...@perl.org

Ben Morrow wrote:
>Does glibc have a malloc_size() function (or equivalent),

No.

-zefram

Elizabeth Mattijsen

unread,

Aug 17, 2010, 4:15:14 AM8/17/10

to Wolfram Humann, Jan Dubois, David Golden, Reini Urban, perl5-...@perl.org

On Aug 17, 2010, at 10:09 AM, Wolfram Humann wrote:

> -------- Original Message --------
> Subject: Re: Patch to make string-append on win32 100 times faster
> From: Jan Dubois <ja...@activestate.com>
> To: 'David Golden' <xda...@gmail.com>, 'Reini Urban' <rur...@x-ray.at>
> Date: 17.08.2010 01:09
>> On Mon, 16 Aug 2010, David Golden wrote:
>>
>>> I think I'm with Reini on this one. I'd rather keep it Windows-only until a
>>> performance problem is documented elsewhere.
>>>
>>
>> It's been documented that the patch improves performance on FreeBSD by a
>> *factor* of 160 for one particular test case, as well as a factor of 4
>> on Linux. In the meantime Wolfram has constructed another testcase where
>> the patch improves performance on Linux by a factor of 45 (my system) to
>> 230 (Wolfram's system).
>>
>> Granted, these are worst-case samples, but at least the first test
>> program was derived from a real-world application: appending lots of
>> small strings to an ever-growing large buffer.

> I think that even my worst-case Linux test case is not totally out-of-the-world. It occurs whenever a large number of strings grows simultaneously by small increments. One application that comes to mind is pivoting (exchanging rows and columns of) a large text-based table: loop over the lines, splitting each row and appending the chunks to the respective column's string; finally print the column strings as new rows when the complete table is read.

FWIW, I have several pieces of important business logic that add packed integers to hash values for millions of iterations. Seems like a worst case to me.

Having said that, I've been moving these cases to use a temporary file, basically using the file system as a hash. This in the end, proved to work a lot faster and scaled much better, way beyond the physical RAM that was available.

Liz

Wolfram Humann

unread,

Aug 17, 2010, 4:09:37 AM8/17/10

to Jan Dubois, David Golden, Reini Urban, perl5-...@perl.org

-------- Original Message --------
Subject: Re: Patch to make string-append on win32 100 times faster
From: Jan Dubois <ja...@activestate.com>
To: 'David Golden' <xda...@gmail.com>, 'Reini Urban' <rur...@x-ray.at>
Date: 17.08.2010 01:09

> On Mon, 16 Aug 2010, David Golden wrote:
>
>> I think I'm with Reini on this one. I'd rather keep it Windows-only until a
>> performance problem is documented elsewhere.
>>
>
> It's been documented that the patch improves performance on FreeBSD by a
> *factor* of 160 for one particular test case, as well as a factor of 4
> on Linux. In the meantime Wolfram has constructed another testcase where
> the patch improves performance on Linux by a factor of 45 (my system) to
> 230 (Wolfram's system).
>
> Granted, these are worst-case samples, but at least the first test
> program was derived from a real-world application: appending lots of
> small strings to an ever-growing large buffer.

I think that even my worst-case Linux test case is not totally
out-of-the-world. It occurs whenever a large number of strings grows
simultaneously by small increments. One application that comes to mind
is pivoting (exchanging rows and columns of) a large text-based table:
loop over the lines, splitting each row and appending the chunks to the
respective column's string; finally print the column strings as new rows
when the complete table is read.

Wolfram

Reini Urban

unread,

Aug 17, 2010, 5:19:26 AM8/17/10

to Jan Dubois, Wolfram Humann, perl5-...@perl.org

2010/8/16 Jan Dubois <ja...@activestate.com>:

> On Mon, 16 Aug 2010, Reini Urban wrote:
>> 2010/8/16 Jan Dubois <ja...@activestate.com>:
>> > Could you provide some evidence for this claim? The only way a
>> > "better malloc" can prevent this slowdown is by doing some kind
>> > of overallocation itself. Since the algorithm in this patch
>> > is not necessarily cumulative with the overallocation by malloc()
>> > it is very well possible that the change has no effect at all
>> > on systems with a overallocating malloc().
>>
>> This is just theory. I KNOW that plain gnumalloc and freebsd realloc
>> do work fine.
>> But now that we have the mess someone can test it. I don't have such systems
>> so I cannot test it.
>
> I'm getting a bit tired of you discarding Ben's actual test based on
> some "knowledge" you imply to have that you can't even verify because
> you don't have access to those systems. This is a technical issue, not
> a religious belief!

I just studied the technical papers which discuss different system mallocs.

>> > Because then it wouldn't be applied to other platforms, like FreeBSD.
>>
>> Uuh, nobody ever complained about freebsd realloc performance.
>> It was always the fastest on the planet and still is.
>
> I pointed out *twice* to you that Ben did actually measure it and came up
> with pretty bad numbers for the freebsd reallocator. You will need to come
> up with some evidence why his benchmarks should be discarded.

Sorry Jan and Ben. So I seem to have misread that thread.
I'm satisfied with Wolfram's patch.

> To provide similar numbers for Linux and GNU malloc I've compiled bleadperl
> just before and after the patch in question, both with usemymalloc=y
> and usemymalloc=n. The results are just for the "1E7 chars + 1E5 x 1E1 chars"
> benchmark, as that is the slowest of the bunch. I've run the benchmark
> script 100 times for each Perl build and show the min/max runtimes to show
> that there is quite a bit of noise:
>
> Before, GNU malloc: Min=38.6 Max=45.4 Avg=41.20
> Before, Perl malloc: Min= 7.9 Max=14.2 Avg=11.14
>
> After, GNU malloc: Min= 9.7 Max=12.7 Avg=11.45
> After, Perl malloc: Min= 9.4 Max=13.2 Avg=11.16
>
> It shows that GNU malloc on its own takes 4 times as long as GNU malloc with
> the patch. GNU malloc with this patch matches the time used by the Perl
> malloc (usemymalloc=y), which doesn't seem to be affected by the patch.

> You seem to have Cygwin available to you. Why don't you just test the

> patch and report on any actual problems instead of spreading bad attitude?

Because I have other problems with blead right now.
Somehow my markstack is corrupted, so I get crashes in the simpliest
dynaloader boot_ calls.

Sorry for my bad attitude. I thought there were no benchmarks for this
patch on the general platforms.
--
Reini Urban

Wolfram Humann

unread,

Aug 17, 2010, 2:18:24 PM8/17/10

to Jan Dubois, perl5-...@perl.org

I tried to look at memory consumption using Linux::Smaps but my simple
test case (in an endless loop: append 1024000 chars to a string, report
memory usage) shows way too little memory overhead, especially when my
patch or usemymalloc (which, when I looked at memory use on win32,
always results in even bigger memory footprints than my patch) is
active. I have no experience with Linux::Smaps. Is it the wrong tool for
this purpose? Any ideas what to use instead?

Command line:
perl -MLinux::Smaps -MConfig -E'$u=usemymalloc; say "$u: ",$Config{$u};
$m=Linux::Smaps->new; while(1){$s .= "#" for 1..1024000; $m->update;
printf"$_:%6d, ",$m->$_ for qw(rss shared_clean shared_dirty
private_clean private_dirty); printf"string:%6d k\n",length($s)/1024} '

First 10 results for each of the four cases:

Before, GNU alloc:
usemymalloc: n
rss: 4052, shared_clean: 1600, shared_dirty: 0, private_clean: 4, private_dirty: 2448, string: 1000 k
rss: 5056, shared_clean: 1600, shared_dirty: 0, private_clean: 8, private_dirty: 3448, string: 2000 k
rss: 6056, shared_clean: 1600, shared_dirty: 0, private_clean: 8, private_dirty: 4448, string: 3000 k
rss: 7056, shared_clean: 1600, shared_dirty: 0, private_clean: 8, private_dirty: 5448, string: 4000 k
rss: 8056, shared_clean: 1600, shared_dirty: 0, private_clean: 8, private_dirty: 6448, string: 5000 k
rss: 9056, shared_clean: 1600, shared_dirty: 0, private_clean: 8, private_dirty: 7448, string: 6000 k
rss: 10056, shared_clean: 1600, shared_dirty: 0, private_clean: 8, private_dirty: 8448, string: 7000 k
rss: 11056, shared_clean: 1600, shared_dirty: 0, private_clean: 8, private_dirty: 9448, string: 8000 k
rss: 12056, shared_clean: 1600, shared_dirty: 0, private_clean: 8, private_dirty: 10448, string: 9000 k
rss: 13056, shared_clean: 1600, shared_dirty: 0, private_clean: 8, private_dirty: 11448, string: 10000 k

Before, Perl alloc:
usemymalloc: y
rss: 3992, shared_clean: 576, shared_dirty: 0, private_clean: 1036, private_dirty: 2380, string: 1000 k
rss: 4996, shared_clean: 576, shared_dirty: 0, private_clean: 1040, private_dirty: 3380, string: 2000 k
rss: 5996, shared_clean: 576, shared_dirty: 0, private_clean: 1040, private_dirty: 4380, string: 3000 k
rss: 6996, shared_clean: 576, shared_dirty: 0, private_clean: 1040, private_dirty: 5380, string: 4000 k
rss: 7996, shared_clean: 576, shared_dirty: 0, private_clean: 1040, private_dirty: 6380, string: 5000 k
rss: 8996, shared_clean: 576, shared_dirty: 0, private_clean: 1040, private_dirty: 7380, string: 6000 k
rss: 9996, shared_clean: 576, shared_dirty: 0, private_clean: 1040, private_dirty: 8380, string: 7000 k
rss: 10996, shared_clean: 576, shared_dirty: 0, private_clean: 1040, private_dirty: 9380, string: 8000 k
rss: 11996, shared_clean: 576, shared_dirty: 0, private_clean: 1040, private_dirty: 10380, string: 9000 k
rss: 12996, shared_clean: 576, shared_dirty: 0, private_clean: 1040, private_dirty: 11380, string: 10000 k

After, GNU alloc:
usemymalloc: n
rss: 4068, shared_clean: 584, shared_dirty: 0, private_clean: 1024, private_dirty: 2460, string: 1000 k
rss: 5072, shared_clean: 584, shared_dirty: 0, private_clean: 1028, private_dirty: 3460, string: 2000 k
rss: 6072, shared_clean: 584, shared_dirty: 0, private_clean: 1028, private_dirty: 4460, string: 3000 k
rss: 7072, shared_clean: 584, shared_dirty: 0, private_clean: 1028, private_dirty: 5460, string: 4000 k
rss: 8072, shared_clean: 584, shared_dirty: 0, private_clean: 1028, private_dirty: 6460, string: 5000 k
rss: 9072, shared_clean: 584, shared_dirty: 0, private_clean: 1028, private_dirty: 7460, string: 6000 k
rss: 10072, shared_clean: 584, shared_dirty: 0, private_clean: 1028, private_dirty: 8460, string: 7000 k
rss: 11072, shared_clean: 584, shared_dirty: 0, private_clean: 1028, private_dirty: 9460, string: 8000 k
rss: 12072, shared_clean: 584, shared_dirty: 0, private_clean: 1028, private_dirty: 10460, string: 9000 k
rss: 13072, shared_clean: 584, shared_dirty: 0, private_clean: 1028, private_dirty: 11460, string: 10000 k

After, Perl alloc:
usemymalloc: y
rss: 4000, shared_clean: 576, shared_dirty: 0, private_clean: 1036, private_dirty: 2388, string: 1000 k
rss: 6036, shared_clean: 576, shared_dirty: 0, private_clean: 1040, private_dirty: 4420, string: 2000 k
rss: 7036, shared_clean: 576, shared_dirty: 0, private_clean: 1040, private_dirty: 5420, string: 3000 k
rss: 8036, shared_clean: 576, shared_dirty: 0, private_clean: 1040, private_dirty: 6420, string: 4000 k
rss: 9036, shared_clean: 576, shared_dirty: 0, private_clean: 1040, private_dirty: 7420, string: 5000 k
rss: 10036, shared_clean: 576, shared_dirty: 0, private_clean: 1040, private_dirty: 8420, string: 6000 k
rss: 11036, shared_clean: 576, shared_dirty: 0, private_clean: 1040, private_dirty: 9420, string: 7000 k
rss: 12036, shared_clean: 576, shared_dirty: 0, private_clean: 1040, private_dirty: 10420, string: 8000 k
rss: 13036, shared_clean: 576, shared_dirty: 0, private_clean: 1040, private_dirty: 11420, string: 9000 k
rss: 14036, shared_clean: 576, shared_dirty: 0, private_clean: 1040, private_dirty: 12420, string: 10000 k

demerphq

unread,

Aug 20, 2010, 10:53:17 AM8/20/10

to Jan Dubois, Reini Urban, Wolfram Humann, perl5-...@perl.org

On 16 August 2010 08:19, Jan Dubois <ja...@activestate.com> wrote:
> On Sun, 15 Aug 2010, Reini Urban wrote:
>> Jan Dubois schrieb:
>> > On Fri, 30 Jul 2010, Wolfram Humann wrote:
>> > The discussion of this change seemed to have stalled, but I see
>> > +1 votes from David Golden and Marvin Humphrey, with additional
>> > information from Ben Morrow that the patch also helps on FreeBSD
>> > (when building without -Dusemymalloc), with nobody voicing
>> > any disagreement about applying the patch.
>>
>> This particular slowdown was only recognized for WIN32 native malloc,
>> but not for other platforms with better malloc libs.
>
> Did you read the paragraph you quoted above? It explicitly claims that
> the slowdown happens on other platforms when using the platform native
> malloc.
>
>> Those platforms are now hurt by using overallocation,
>> i.e. need more memory, e.g. with piping.
>
> Could you provide some evidence for this claim? The only way a
> "better malloc" can prevent this slowdown is by doing some kind
> of overallocation itself.

This is not correct. Mallocs/reallocs that can merge blocks do not
have the performance penalty that this algorithm seeks to work around.
The problem here is that the Win32 realloc always copies, and thus
extending a block a character at a time becomes exponential. With a
realloc that merges blocks and only copies where there is insufficient
contiguous blocks does not have this problem.

Nicholas had some impressive benchmarks from this change for AIX or
HPUX (i forget), whose realloc has the same problems as does Win32's.

Cheers,
yves

demerphq

unread,

Aug 20, 2010, 11:05:41 AM8/20/10

to Jan Dubois, Reini Urban, Wolfram Humann, perl5-...@perl.org

Ill just note that im not arguing against this patch. Just that
overallocation is not the only reason that a malloc might not be
penalized by this change.

One real-world benchmark that people might want to try would be to use
a routine like this:

sub make_tree {
my ($depth) = shift;
return int rand 100 unless $depth>0;
return [ make_tree($depth-1), make_tree($depth-1) ]
}

and then use the XS implementation of Data::Dumper to dump the results
of make_tree() for various N.

On win32 even modest N will result in the machine essentially hanging.
On no other OS that I've tried it on is the slowdown as noticeable.
This was traced to the use of realloc in SV_GROW(). This was the
analysis that lead to Nicholas' original patch.

cheers,
Yves

Jan Dubois

unread,

Aug 20, 2010, 1:43:34 PM8/20/10

to demerphq, Reini Urban, Wolfram Humann, perl5-...@perl.org

On Fri, 20 Aug 2010, demerphq wrote:
> On 20 August 2010 16:53, demerphq <deme...@gmail.com> wrote:
>> On 16 August 2010 08:19, Jan Dubois <ja...@activestate.com> wrote:
>>> Could you provide some evidence for this claim? The only way a
>>> "better malloc" can prevent this slowdown is by doing some kind of
>>> overallocation itself.
>>
>> This is not correct. Mallocs/reallocs that can merge blocks do not
>> have the performance penalty that this algorithm seeks to work
>> around. The problem here is that the Win32 realloc always copies, and
>> thus extending a block a character at a time becomes exponential.
>> With a realloc that merges blocks and only copies where there is
>> insufficient contiguous blocks does not have this problem.
>
> Ill just note that im not arguing against this patch. Just that
> overallocation is not the only reason that a malloc might not be
> penalized by this change.

Indeed, "the only way" was a rather poor choice of words.

Note though that all the block merge algorithms still depends heavily
on the allocation pattern. If there are no free blocks to merge, then
they still provide poor performance, and that case *will* be helped
by the overallocation during the growth phase. This has been shown
for both FreeBSD and the Linux allocators.

> One real-world benchmark that people might want to try would be to use
> a routine like this:
>
> sub make_tree {
> my ($depth) = shift;
> return int rand 100 unless $depth>0;
> return [ make_tree($depth-1), make_tree($depth-1) ]
> }
>
> and then use the XS implementation of Data::Dumper to dump the results
> of make_tree() for various N.
>
> On win32 even modest N will result in the machine essentially hanging.
> On no other OS that I've tried it on is the slowdown as noticeable.
> This was traced to the use of realloc in SV_GROW(). This was the
> analysis that lead to Nicholas' original patch.

I don't see how sv_grow() comes into this example at all. Are you talking
about av_expand()?

That is indeed an area that should probably be scrutinized for potential
Windows performance improvements too. I also have some vague memory that
hash expansion beyond something like 100,000 keys gets extremely slow on
Windows.

Cheers,
-Jan

demerphq

unread,

Aug 20, 2010, 2:59:18 PM8/20/10

to Jan Dubois, Reini Urban, Wolfram Humann, perl5-...@perl.org

On 20 August 2010 19:43, Jan Dubois <ja...@activestate.com> wrote:
> On Fri, 20 Aug 2010, demerphq wrote:
>> On 20 August 2010 16:53, demerphq <deme...@gmail.com> wrote:
>>> On 16 August 2010 08:19, Jan Dubois <ja...@activestate.com> wrote:
>>>> Could you provide some evidence for this claim? The only way a
>>>> "better malloc" can prevent this slowdown is by doing some kind of
>>>> overallocation itself.
>>>
>>> This is not correct. Mallocs/reallocs that can merge blocks do not
>>> have the performance penalty that this algorithm seeks to work
>>> around. The problem here is that the Win32 realloc always copies, and
>>> thus extending a block a character at a time becomes exponential.
>>> With a realloc that merges blocks and only copies where there is
>>> insufficient contiguous blocks does not have this problem.
>>
>> Ill just note that im not arguing against this patch. Just that
>> overallocation is not the only reason that a malloc might not be
>> penalized by this change.
>
> Indeed, "the only way" was a rather poor choice of words.
>
> Note though that all the block merge algorithms still depends heavily
> on the allocation pattern. If there are no free blocks to merge, then
> they still provide poor performance, and that case *will* be helped
> by the overallocation during the growth phase. This has been shown
> for both FreeBSD and the Linux allocators.

IMO this is "just" common sense. If we call realloc/malloc less often
then the underlying code will be faster. Given that the perl
tradition is to trade memory for speed in most situations this patch
makes sense -- and it /also/ provides a vector for us to provide a

use less ''memory';

which would control how much we preallocate.

>> One real-world benchmark that people might want to try would be to use
>> a routine like this:
>>
>> sub make_tree {
>> my ($depth) = shift;
>> return int rand 100 unless $depth>0;
>> return [ make_tree($depth-1), make_tree($depth-1) ]
>> }
>>
>> and then use the XS implementation of Data::Dumper to dump the results
>> of make_tree() for various N.
>>
>> On win32 even modest N will result in the machine essentially hanging.
>> On no other OS that I've tried it on is the slowdown as noticeable.
>> This was traced to the use of realloc in SV_GROW(). This was the
>> analysis that lead to Nicholas' original patch.
>
> I don't see how sv_grow() comes into this example at all. Are you talking
> about av_expand()?

Actually no, im talking about sv_grow(). Dont forget I said to pass
the output of make_tree() into the XS version of Data::Dumper, which
will stringify the resulting binary tree by extending a buffer
essentially character by character (actually token by token, but this
example is pretty close to char by char) to add open brackets,
integers, and close brackets. It uses sv_grow to keep growing the
buffer as it does so. Interestingly on Windows you can often make
Data::Dumper go faster by NOT using the XS version by forcing it to
use the pure perl version via $Data::Dumper::Useqq or more recent
equivalents - im not sure exactly why this makes a difference, I never
dug, however Nicholas was able to demonstrate a huge speedup on some
platforms with his original hack along these lines.. Basically it
turned out that DD was calling realloc() pretty much once per
character emitted while dumping. By preallocating an extra 10% the
number of realloc calls was reduced dramatically.

Also, note that av_expand() is never called in my example as all
arrays are preallocated with 8 elements (er, i think - certainly more
than 2) and my code uses only 2 per array. The point here is that the
code will produce *many* arrays, and will require DD to do a lot of
concatenation to do the dump.

> That is indeed an area that should probably be scrutinized for potential
> Windows performance improvements too. I also have some vague memory that
> hash expansion beyond something like 100,000 keys gets extremely slow on
> Windows.

Sounds interesting, but its not the case I meant, i was strictly
referring to the way the XS version of Data::Dumper builds the
stringified version of the object being dumped.

Cheers,
yves

demerphq

unread,

Aug 20, 2010, 3:13:56 PM8/20/10

to Jan Dubois, Wolfram Humann, perl5-...@perl.org

I suspect this was Nicholas'es hack to reduce the number of times
realloc is called when appending single characters to a buffer over
and over.

One thing I have a bit of hesitancy about is it seems the benchmarks
here, (emphasis on seems i havent looked very closely unfortunately)
is that they /look/ like they involve concatenating large strings
together. These types of benchmarks will "blow" the merge block
algorithm used in some reallocs (depending on memory pressures, etc on
the system at the time), the original thought was to address the
performance of code like this:

my $str="";
$str.="x" for 1...100000;

If we dont overallocate AND we are using a malloc/realloc that doesnt
do block merges (like Windows worked when I last used Windows) you end
up getting quadratic performance as each concatenation ends up being a
realloc call, which in turn ends up doing a memcopy(). Which means one
ends up doing a ridiculously large amount of data copying.

If i recall correctly Nicholas made a comment along the lines of "best
performance gain for least effort" of any patch he had ever written.

Jan Dubois

unread,

Aug 20, 2010, 3:43:30 PM8/20/10

to demerphq, Wolfram Humann, perl5-...@perl.org

On Fri, 20 Aug 2010, demerphq wrote:

> On 14 August 2010 02:39, Jan Dubois <ja...@activestate.com> wrote:
> > Things that could benefit from further discussion are:
> >
> > a) the algorithm in sv_grow() for expanding SVs with SvOOK
> >
> > if (newlen > SvLEN(sv))
> > newlen += 10 * (newlen - SvCUR(sv)); /* avoid copy each time */
> >
> > This doesn't make much sense, and is catastrophic when
> > you append a very long string to a short one. I have not looked
> > into the heritage of this algorithm yet.
>
> I suspect this was Nicholas'es hack to reduce the number of times
> realloc is called when appending single characters to a buffer over
> and over.

This code is only executed when the SvOOK bit is set (which means that
some bytes at the beginning of the allocated block have been chopped
off). Are you sure that applies to the Data::Dumper::XS case (yes, I'm
too lazy^H^H^H^Hbusy to hunt through Dumper.xs right now).

> One thing I have a bit of hesitancy about is it seems the benchmarks
> here, (emphasis on seems i havent looked very closely unfortunately)
> is that they /look/ like they involve concatenating large strings
> together. These types of benchmarks will "blow" the merge block
> algorithm used in some reallocs (depending on memory pressures, etc on
> the system at the time), the original thought was to address the
> performance of code like this:
>
> my $str="";
> $str.="x" for 1...100000;

No the benchmarks in this thread are all about concatenating small strings
to large buffers.

Just to reiterate it once more: the "new" patch will not kick in *at all*
if the required new size is more than (25% + 10 bytes) larger than the
old size. It only helps repeated appends of smallish increments.

> If we dont overallocate AND we are using a malloc/realloc that doesnt
> do block merges (like Windows worked when I last used Windows) you end
> up getting quadratic performance as each concatenation ends up being a
> realloc call, which in turn ends up doing a memcopy(). Which means one
> ends up doing a ridiculously large amount of data copying.

Yes, that is exactly the problem this change is supposed to "solve".

> If i recall correctly Nicholas made a comment along the lines of "best
> performance gain for least effort" of any patch he had ever written.

Could you hunt down the original discussion (and commit id) of Nicholas
patch? I don't quite understand how we can still get such huge improvements
if this is supposedly already fixed.

The one patch from Nicholas that I know about is that under -Dusemymalloc
he queries the size of the new block and assigns that number to SvLEN
instead of the requested size. That prevents additional calls to sv_grow()
when the block wouldn't have to move because realloc() already overallocated.

But that patch does nothing for other allocators.

Cheers,
-Jan

demerphq

unread,

Aug 21, 2010, 5:18:56 AM8/21/10

to Ben Morrow, Wolfram Humann, Reini Urban, perl5-...@perl.org, Jan Dubois

Ben, Wolfram, any chance you can try benchmarking this with and
without the new patch?

Wolfram Humann

unread,

Aug 21, 2010, 6:52:39 AM8/21/10

to demerphq, Ben Morrow, Reini Urban, perl5-...@perl.org, Jan Dubois

-------- Original Message --------
Subject: Re: Patch to make string-append on win32 100 times faster

For a patched win32 perl I'll have to wait till Monday. What I can
compare now is Strawberry (usemymalloc=n) to Cygwin (usemymalloc=y).
This slightly golfed version

perl -MData::Dumper -E"sub mt{my $d=shift;$d>0?[mt($d-1),mt($d-1)]:int
rand 100}; say length Dumper mt 18"
34576966

needs approx. 6 seconds to print and 8 seconds to finish on both of
these. No noticeable difference. With a touch of horror I have to report
that the printed stringlengths are slightly non-deterministic from run
to run...???

Wolfram

demerphq

unread,

Aug 21, 2010, 7:46:27 AM8/21/10

to Wolfram Humann, Ben Morrow, Reini Urban, perl5-...@perl.org, Jan Dubois

On 21 August 2010 12:52, Wolfram Humann <w.c.h...@arcor.de> wrote:
> For a patched win32 perl I'll have to wait till Monday. What I can compare
> now is Strawberry (usemymalloc=n) to Cygwin (usemymalloc=y). This slightly
> golfed version

heh. :-)

> perl -MData::Dumper -E"sub mt{my $d=shift;$d>0?[mt($d-1),mt($d-1)]:int rand
> 100}; say length Dumper mt 18"
> 34576966
>
> needs approx. 6 seconds to print and 8 seconds to finish on both of these.

Interesting, how does that compare for you if you stick a

$Data::Dumper::Useqq=1;

before the dump?

> No noticeable difference. With a touch of horror I have to report that the
> printed stringlengths are slightly non-deterministic from run to run...???

:-)

No worries there, the code does use rand() after all, so we expect to
see slightly different results from run to run (or the srand
initialization would be broken). Add a srand(1) or change the int rand
100 to a constant and it should produce repeatable results.

Anyway, thanks, i recall code like this taking unacceptably long (iow,
killed before it stopped) on Win2k with the XS version, and it taking
just-barely-acceptably long with the PP version (enabled by the Useqq
option as the XS code doesnt know how to do double-quoting.)

Cheers,
yves

demerphq

unread,

Aug 21, 2010, 8:18:35 AM8/21/10

to Jan Dubois, Wolfram Humann, perl5-...@perl.org, Nicholas Clark

On 21 August 2010 14:13, demerphq <deme...@gmail.com> wrote:

> Im pretty sure its this one:
>
> $ git log -1 1936d2a74fbc35dd0d686
> commit 1936d2a74fbc35dd0d6862386c9988d92bb8e6e0
> Author: Nicholas Clark <ni...@ccl4.org>
> Date: Wed Jun 1 20:46:02 2005 +0000
>
> Round up all string length requests to malloc()/realloc() to the next
> multiple of 4/8 bytes [sizeof(size_t)] on the assumption that malloc()
> internally will quantise, and so we're going to use space that
> otherwise would be wasted. Hopefully this will save realloc()ing.
>
> p4raw-id: //depot/perl@24665
>
> But i cant find the mail about it. I do recall some of the discussion
> was on IRC, but I also remember a mail. Unfortunately (for some not
> very unfortunate definition of unfortunate :-) Nicholas improves
> performance all the time, so the keywords I keep looking for in my
> mailbox give a rather large number of hits and ive yet to find the
> right one. Perhaps he remembers?

Right after this I tried a different search criteria and while it
didnt find the mail I vaguely remember it does seem somewhat relevent:

Avoiding realloc

Nicholas Clark had a look at way perl allocates memory for scalars. Up
until now the exact size has always been requested, and the allocated
size is recorded. This helps avoid unnecessary trips to the heap when
a string shrinks and then grows yet stays within the size of the
original allocation.

On the assumption that "malloc" implementations silently round up
anyway, because it simplifies the internal bookkeeping, it makes sense
to round up the requested size silently on perl's side anyway, because
then scalars that creep a little bit past the initial requested size
might still remain within the initial requested size, thereby avoiding
a costly "realloc".

Nicholas then set about trying to determine the exact number of bytes
that are in fact allocated for all small allocation values (where
small means between 1 and 256). It turns out that a brute force "round
up to nearest multiple of 4 on a 32-bit CPU" rule wasn't particularly
effective. "malloc" sometimes allocates much more, especially on
FreeBSD. Nicholas wanted to know whether it would be worth the effort
to probe for these sizes at "Configure" time, build a lookup table,
and tweak the already monstrous "PERL_STRLEN_ROUNDUP" macro to use it
for short allocation sizes.

Nicholas simply wanted to minimise the number of calls made to
"sv_grow", without requiring any extra code or tests elsewhere in the
interpreter. Gisle was unconvinced, reasoning that it takes memory
away from "malloc" that "malloc" may be able to deploy more
effectively elsewhere at some further point in time.

Use a lookup table to save memory
http://xrl.us/jqvw

demerphq

unread,

Aug 21, 2010, 8:13:11 AM8/21/10

to Jan Dubois, Wolfram Humann, perl5-...@perl.org, Nicholas Clark

On 20 August 2010 21:43, Jan Dubois <ja...@activestate.com> wrote:

Im pretty sure its this one:

$ git log -1 1936d2a74fbc35dd0d686
commit 1936d2a74fbc35dd0d6862386c9988d92bb8e6e0
Author: Nicholas Clark <ni...@ccl4.org>
Date: Wed Jun 1 20:46:02 2005 +0000

Round up all string length requests to malloc()/realloc() to the next
multiple of 4/8 bytes [sizeof(size_t)] on the assumption that malloc()
internally will quantise, and so we're going to use space that
otherwise would be wasted. Hopefully this will save realloc()ing.

p4raw-id: //depot/perl@24665

But i cant find the mail about it. I do recall some of the discussion
was on IRC, but I also remember a mail. Unfortunately (for some not
very unfortunate definition of unfortunate :-) Nicholas improves
performance all the time, so the keywords I keep looking for in my
mailbox give a rather large number of hits and ive yet to find the
right one. Perhaps he remembers?

cheers
Yves

Wolfram Humann

unread,

Aug 24, 2010, 1:19:11 PM8/24/10

to demerphq, perl5-...@perl.org

On 21.08.2010 13:46, demerphq wrote:
> On 21 August 2010 12:52, Wolfram Humann<w.c.h...@arcor.de> wrote:
>
>
>> perl -MData::Dumper -E"sub mt{my $d=shift;$d>0?[mt($d-1),mt($d-1)]:int rand
>> 100}; say length Dumper mt 18"
>> 34576966
>>
>> needs approx. 6 seconds to print and 8 seconds to finish on both of these.
>>
> Interesting, how does that compare for you if you stick a
>
> $Data::Dumper::Useqq=1;
>
> before the dump?

Some more results, wristwatch timed -> ignore minor differences:

N=20
strawberry: 27 seconds
with my patch: 29 seconds
usemymalloc: 10 seconds

With $Data::Dumper::Useqq=1 I had to reduce N to 19 to avoid
out-of-memory. Strawberryperl unpatched only.

Without $Data::Dumper::Useqq=1: 10 seconds, 462 MB
With $Data::Dumper::Useqq=1: 28 seconds, 1100 MB

>> No noticeable difference. With a touch of horror I have to report that the
>> printed stringlengths are slightly non-deterministic from run to run...???
>>
> :-)
>
> No worries there, the code does use rand() after all, so we expect to
> see slightly different results from run to run (or the srand
> initialization would be broken). Add a srand(1) or change the int rand
> 100 to a constant and it should produce repeatable results.
>

O.k. so no need to be worried about the code. But maybe I need to be
worried about a mind that is obvously able to type "rand" one moment and
complain about the code running non-deterministic about 5 minutes later ;-)

I'll be on vacation soon. Won't be able to respond to news in this
thread quickly (maybe not at all) for the weeks to come.

Wolfram

demerphq

unread,

Aug 24, 2010, 1:39:47 PM8/24/10

to Wolfram Humann, perl5-...@perl.org

On 24 August 2010 19:19, Wolfram Humann <w.c.h...@arcor.de> wrote:
> On 21.08.2010 13:46, demerphq wrote:
>>
>> On 21 August 2010 12:52, Wolfram Humann<w.c.h...@arcor.de> wrote:
>>
>>>
>>> perl -MData::Dumper -E"sub mt{my $d=shift;$d>0?[mt($d-1),mt($d-1)]:int
>>> rand
>>> 100}; say length Dumper mt 18"
>>> 34576966
>>>
>>> needs approx. 6 seconds to print and 8 seconds to finish on both of
>>> these.
>>>
>>
>> Interesting, how does that compare for you if you stick a
>>
>> $Data::Dumper::Useqq=1;
>>
>> before the dump?
>
> Some more results, wristwatch timed -> ignore minor differences:
>
> N=20
> strawberry: 27 seconds
> with my patch: 29 seconds
> usemymalloc: 10 seconds
>
> With $Data::Dumper::Useqq=1 I had to reduce N to 19 to avoid out-of-memory.
> Strawberryperl unpatched only.
>
> Without $Data::Dumper::Useqq=1: 10 seconds, 462 MB
> With $Data::Dumper::Useqq=1: 28 seconds, 1100 MB

Interesting...

>>> No noticeable difference. With a touch of horror I have to report that
>>> the
>>> printed stringlengths are slightly non-deterministic from run to
>>> run...???
>>>
>>
>> :-)
>>
>> No worries there, the code does use rand() after all, so we expect to
>> see slightly different results from run to run (or the srand
>> initialization would be broken). Add a srand(1) or change the int rand
>> 100 to a constant and it should produce repeatable results.
>>
>
> O.k. so no need to be worried about the code. But maybe I need to be worried
> about a mind that is obvously able to type "rand" one moment and complain
> about the code running non-deterministic about 5 minutes later ;-)

I suspect your holiday will resolve any concerns you might have in
this regard....

> I'll be on vacation soon. Won't be able to respond to news in this thread
> quickly (maybe not at all) for the weeks to come.

Have fun... And dont worry about this thread, seems to me the point is proved...

cheers
Yves