Patch to make string-append on win32 100 times faster

Showing 1-47 of 47 messages
Patch to make string-append on win32 100 times faster Wolfram Humann 7/30/10 6:37 AM
[sent first without being subscribed; didn't see message appearing in
any of the archvies; subscribed; resend now; hope I don't double-post...]

My first post to p5p -- let me know if I should have gone about this
differently. This discussion started on c.l.p.m., subject "Speed of
reading some MB of data using qx(...)", see
http://groups.google.com/group/comp.lang.perl.misc/browse_thread/thread/b7c9133ff20009f2.
I'll try to summarize my current understanding of matters:

When a string grows (e.g. gets appended to), perl calls sv_grow. When
sv_grow decides that the memory currently allocated to the string is
insufficient, it calls saferealloc. Depending on whether or not perl was
compiled with 'usemymalloc' this calls realloc in either perls internal
version or on the operating system. Perl requests from realloc just the
amount of memory required for the current operation. With 'usemymalloc'
this is not a problem because it rounds up memory allocation to a
certain geometric progression anyway. When the operating system's
realloc is called, this may or may not lead to desaster, depending on
how it's implemented. On win32 it does lead to desaster: when I loop
1000 times and each time append 1000 chars to an initial string size of
10 million, the memory grows from 10.000e6 to 10.001e6 to 10.002e6 and
so on 1000 times till it ends at 11.000e6.

I would like to propose that perl requests an additional amount of
memory from realloc when strings grow by a small amount. After
discussion on c.l.p.m. and with the help of ideas from Ilya Zakharevich
and Peter J. Holzer my proposal reads like this:
(I'm not a regular user of diff and using it wrong might do more harm
than good)

In sv.c the following section in Perl_sv_grow():

     if (newlen > SvLEN(sv)) {        /* need more room? */
#ifndef Perl_safesysmalloc_size
     newlen = PERL_STRLEN_ROUNDUP(newlen);
#endif

should be expanded to:

#ifndef PERL_STRLEN_EXPAND_SHIFT
#  define PERL_STRLEN_EXPAND_SHIFT 2
#endif

     if (newlen > SvLEN(sv)) {        /* need more room? */
     size_t minlen  = SvCUR(sv);
     minlen += (minlen >> PERL_STRLEN_EXPAND_SHIFT) + 10;
         if (newlen < minlen) newlen = minlen;
/*        newlen += (newlen >> 1) + 10; */
#ifndef Perl_safesysmalloc_size
     newlen = PERL_STRLEN_ROUNDUP(newlen);
#endif

I've posted a benchmark with results on c.l.p.m.. Therefore, two
examples may suffice here:
a) Starting from an initial string of length 1e7 chars, append 1e6 chars
in 10000 chunks of  100 chars each.
b) Read long, multilined output from another process, simulated here by
reading a 12 MB postscript file using $ps = qx(cat postscriptfile.ps)

On current Strawberry Perl (5.10 or 5.12):
a) 2.6 seconds
b) 18 seconds

Perl 5.12 compiled on win32 with modifications as above:
a) 0.014 seconds
b) 0.2 seconds

The value of PERL_STRLEN_EXPAND_SHIFT determines the balance between
extra memory use and speed (i.e. frequency of calls to realloc). "2"
seems to me like a decent compromise.

Wolfram

Re: Patch to make string-append on win32 100 times faster David Golden 7/30/10 7:30 AM
On Fri, Jul 30, 2010 at 9:37 AM, Wolfram Humann <w.c.h...@arcor.de> wrote:
> I would like to propose that perl requests an additional amount of memory
> from realloc when strings grow by a small amount.

+1 -- though I'm not clear whether this win32 specific or whether all
operating systems should work this way.

>    if (newlen > SvLEN(sv)) {        /* need more room? */
>        size_t minlen  = SvCUR(sv);
>        minlen += (minlen >> PERL_STRLEN_EXPAND_SHIFT) + 10;
>        if (newlen < minlen) newlen = minlen;
[snip]

I have a few questions about the specific implementation.  Please
don't take them as suggestions that your approach is necessarily wrong
-- but I'd like to understand the logic behind certain choices.

* Why are you suggesting making the extra allocation proportional to
the length of the string?  Why not a multiple of the length of the
increment?  Why not a fixed amount?

* Why the "+10"?  Why have a extra minimum at all?  Why "10" instead
of another number?  Given your examples, for a large string the 10 is
a trivial fraction of the total and for a small string, the efficiency
either doesn't matter or the string quickly grows into a large string
anyway.

* Why do you think PERL_STRLEN_EXPAND_SHIFT should be 2?  That's a 25%
increase in the length of the string.  (In many programs, I suspect
many strings are appended to only a few times and that seems like
large memory price to pay.)

> On current Strawberry Perl (5.10 or 5.12):
> a) 2.6 seconds
> b) 18 seconds
>
> Perl 5.12 compiled on win32 with modifications as above:
> a) 0.014 seconds
> b) 0.2 seconds

Awesome!  What does it look like with a shift of 3 or 4?

-- David

Re: Patch to make string-append on win32 100 times faster Marvin Humphrey 7/30/10 8:34 AM
On Fri, Jul 30, 2010 at 10:30:27AM -0400, David Golden wrote:
> On Fri, Jul 30, 2010 at 9:37 AM, Wolfram Humann <w.c.h...@arcor.de> wrote:
> > I would like to propose that perl requests an additional amount of memory
> > from realloc when strings grow by a small amount.
>
> +1 -- though I'm not clear whether this win32 specific or whether all
> operating systems should work this way.

IMO all should.

> -- but I'd like to understand the logic behind certain choices.

We discussed this kind of algorithm at length for Lucene and Lucy, including
exploration into what Python and Ruby do.

    Dynamic array reallocation algorithms
    http://markmail.org/thread/x427sku4wrnc3rjf

> * Why are you suggesting making the extra allocation proportional to
> the length of the string?  Why not a multiple of the length of the
> increment?  Why not a fixed amount?

IMO, the overage ought to be proportional to the space required by the final
string.

> * Why the "+10"?  Why have a extra minimum at all?  Why "10" instead
> of another number?  Given your examples, for a large string the 10 is
> a trivial fraction of the total and for a small string, the efficiency
> either doesn't matter or the string quickly grows into a large string
> anyway.

That's similar to my analysis.  I think it's more useful to round up to a
multiple of the pointer size for small values.

> * Why do you think PERL_STRLEN_EXPAND_SHIFT should be 2?  That's a 25%
> increase in the length of the string.  (In many programs, I suspect
> many strings are appended to only a few times and that seems like
> large memory price to pay.)

We decided to go with 12.5%, similar to Python.

    http://markmail.org/message/obehcimh2ly4cd3l

    I agree, a smaller size is better. Say you start with an array which hold 800
    elements but which has been over-allocated by one eighth (+100 elements =
    900). Reallocating at 900 and again at 1000-something isn't that different
    from reallocating only once at 1000. So long as you avoid reallocating at
    every increment -- 801, 802, etc -- you have achieved your main goal.

    Both mild and aggressive over-allocation solve that problem, but aggressive
    over-allocation also involves significant RAM costs. Where the best balance
    lies depends on how bad the reallocation performance is in relation to the
    cost of insertion. An eighth seems pretty good. Doubling seems too high. 1.5,
    dunno, seems like it could work. According to this superficially
    credible-seeming comment at stackoverflow, that's what Ruby does:

    http://stackoverflow.com/questions/1100311/what-is-the-ideal-growth-rate-for-a-dynamically-allocated-array/1100449#1100449

Incidentally, the Python algo was originally intended to address poor
reallocation performance on Windows:

    http://markmail.org/message/hk5kpmgcelvizkif

Marvin Humphrey

Re: Patch to make string-append on win32 100 times faster Wolfram Humann 7/30/10 8:30 AM
On 30.07.2010 16:30, David Golden wrote:
> On Fri, Jul 30, 2010 at 9:37 AM, Wolfram Humann<w.c.h...@arcor.de>  wrote:
>    
>> I would like to propose that perl requests an additional amount of memory
>> from realloc when strings grow by a small amount.
>>      
> +1 -- though I'm not clear whether this win32 specific or whether all
> operating systems should work this way.
>
>    
>>     if (newlen>  SvLEN(sv)) {        /* need more room? */
>>         size_t minlen  = SvCUR(sv);
>>         minlen += (minlen>>  PERL_STRLEN_EXPAND_SHIFT) + 10;
>>         if (newlen<  minlen) newlen = minlen;
>>      
> [snip]
>
> I have a few questions about the specific implementation.  Please
> don't take them as suggestions that your approach is necessarily wrong
> -- but I'd like to understand the logic behind certain choices.
>
> * Why are you suggesting making the extra allocation proportional to
> the length of the string?  Why not a multiple of the length of the
> increment?  Why not a fixed amount?
>    
Take a) the postscript file of 12 MB. Say it has 120k lines of 100 chars
each.
Take b) a string of 100 chars and you append another 100 chars.
Multiply line-length by what?
Say 1000 -> max. memory waste 100kB. a) 120 reallocs, memory waste < 1%
-- o.k. b) 1 realloc, memory waste 99.9 %  -- not acceptable
Say 10: case b) still wastes 80% of it's memory but a) has 12000
reallocs -- not acceptable
Fixed amounts are even worse. Try any value and see!

Geometric progression covers small and big things equally well. Also,
the time required for a realloc increases linearly with the amount of
memory that needs to be moved to a new location. So it does make sense
to realloc less often for big strings.

> * Why the "+10"?  Why have a extra minimum at all?  Why "10" instead
> of another number?  Given your examples, for a large string the 10 is
> a trivial fraction of the total and for a small string, the efficiency
> either doesn't matter or the string quickly grows into a large string
> anyway.
>    
It's debatable of it's required but if you start with an empty string
and add to that char-by-char it does reduce the amount of system calls
in the beginning substantially. I don't know how much it matters if you
have many very small strings (how much is the perl overhead for strings
already)?

> * Why do you think PERL_STRLEN_EXPAND_SHIFT should be 2?  That's a 25%
> increase in the length of the string.  (In many programs, I suspect
> many strings are appended to only a few times and that seems like
> large memory price to pay.)
>    
I compared to Cygwin Perl on the same machine. It's compiled to use the
perl internal malloc/realloc. It grows memory even more aggressive than
my proposal with a shift of 2. The total memory required for the same
program in Cygwin Perl is often 30% to 50% bigger than in my modified
perl. Also, if you go from 2 to 4 you reduce *max.* overhead from 25% to
12.5% but you also quadruple the amount of reallocs required. This can
add a lot to the runtime of your program.
Also note, that the overhead only occurs if you add small amounts each
time. No overhead occurs if you increase the size of your original
string by 25 % or more in a single append.

>> On current Strawberry Perl (5.10 or 5.12):
>> a) 2.6 seconds
>> b) 18 seconds
>>
>> Perl 5.12 compiled on win32 with modifications as above:
>> a) 0.014 seconds
>> b) 0.2 seconds
>>      
> Awesome!  What does it look like with a shift of 3 or 4?
>
>    
As I mentioned, I put a good amount of benchmark data in my post on
c.l.p.m. This also contains memory usage info. If you can, have a look
there. If that doesn't answer your question or you want me to crosspost
it here, let me know.

Wolfram

Re: Patch to make string-append on win32 100 times faster Ben Morrow 7/30/10 8:49 AM
Quoth xda...@gmail.com (David Golden):

> On Fri, Jul 30, 2010 at 9:37 AM, Wolfram Humann <w.c.h...@arcor.de> wrote:
> > I would like to propose that perl requests an additional amount of memory
> > from realloc when strings grow by a small amount.
>
> +1 -- though I'm not clear whether this win32 specific or whether all
> operating systems should work this way.

I did some testing when we were discussing this on clpmisc, and found
that at least FreeBSD has the same poor performance as Win32 when perl
is built without -Dusemymalloc. (The issue is much more severe on Win32,
since it's impossible to build with both pseudofork and usemymalloc.)

Ben

Re: Patch to make string-append on win32 100 times faster Reini Urban 7/30/10 9:04 AM
Marvin Humphrey schrieb:

>> * Why the "+10"?  Why have a extra minimum at all?  Why "10" instead
>> of another number?  Given your examples, for a large string the 10 is
>> a trivial fraction of the total and for a small string, the efficiency
>> either doesn't matter or the string quickly grows into a large string
>> anyway.
>
> That's similar to my analysis.  I think it's more useful to round up to a
> multiple of the pointer size for small values.

I would rather align it to the pointer size, and keep an eye on the
pagesize.

I did similar tricks in Tie::CArray to get decent realloc
(array grow) performance.
http://cpansearch.perl.org/src/RURBAN/Tie-CArray-0.15/CArray.xs

#define PAGEBITS  11  /* we can also use 10 or 12, most system malloc to
12 ie 4096 */
#define MY_PAGESIZE  (1 << PAGEBITS)    /* 2048 byte is the size of a
fresh carray */
/* This is the to-tune part:
  * The overall size should fit into a page or other malloc chunks.
  * Leave room for "some" more items, but align it to the page size.
  * Should small arrays (<100) be aligned at 2048 or smaller bounds?
  * 10 => 2048-10, 2000 => 2048-2000, 200.000 => 2048
  * len is the actual length of the array, size the itemsize in bytes.
  */
int freesize (int len, int size)
{
     len *= size;
     return max(MY_PAGESIZE-len, len - ((len >> PAGEBITS) << PAGEBITS))
/ size;
}
freesize is the slackspace to allocate additionally, and can be inlined.
grow by n:
ptr = saferealloc (ptr, len + freesize(len+n, itemsize));
--
Reini Urban
http://phpwiki.org/  http://murbreak.at/

Re: Patch to make string-append on win32 100 times faster Eric Brine 7/30/10 9:14 AM
On Fri, Jul 30, 2010 at 10:30 AM, David Golden <xda...@gmail.com> wrote:

> * Why are you suggesting making the extra allocation proportional to
> the length of the string?
>

Note that this is how array currently work. Push expands arrays using the
formula

new_size = old_size * 2 + 4

Re: Patch to make string-append on win32 100 times faster w.c.h...@arcor.de 7/30/10 2:26 PM
From: Marvin Humphrey <mar...@rectangular.com>
Date: 30.07.2010 17:34

>> * Why the "+10"?  Why have a extra minimum at all?  Why "10" instead
>> of another number?  Given your examples, for a large string the 10 is
>> a trivial fraction of the total and for a small string, the efficiency
>> either doesn't matter or the string quickly grows into a large string
>> anyway.
>>    
>
> That's similar to my analysis.  I think it's more useful to round up to a
> multiple of the pointer size for small values.
>  
If I understand correctly (which I'm not certain about), this is done
anyway right afterwards in
newlen = PERL_STRLEN_ROUNDUP(newlen);

Wolfram

Re: Patch to make string-append on win32 100 times faster Jim Cromie 7/30/10 9:37 PM

what happens if you presize the variable 1st ?
then assign an empty string,
then start appending the additions.

this will avoid all the sv_grow calls,
all the reallocs, and just use the one-time preallocated memory.

this avoids all the mods to the heuristics youre proposing
(modest tho they might be), and has no unintended effects
on other benchmarkable use cases.

Theres some perl way to do this without explicitly assigning a large string
(as I did below), but I cannot recall it, and cant think of a suitably narrow
search term/strategy for it (hints welcome)

length($s) = 1000000;  # isnt correct

[jimc@harpo perl]$ perl -MDevel::Peek -de1

main::(-e:1):        1
  DB<1> $s='a'x10000;

  DB<2> Dump $s
SV = PV(0x8c03df0) at 0x8c8c578
  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0x8ce9f48
"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa...
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"\0
  CUR = 10000
  LEN = 10004

  DB<3> $s='a'

  DB<4> Dump $s
SV = PV(0x8c03df0) at 0x8c8c578
  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0x8ce9f48 "a"\0
  CUR = 1
  LEN = 10004

Re: Patch to make string-append on win32 100 times faster Eric Brine 7/30/10 10:14 PM
On Sat, Jul 31, 2010 at 12:37 AM, Jim Cromie <jim.c...@gmail.com> wrote:

> this avoids all the mods to the heuristics youre proposing
> (modest tho they might be), and has no unintended effects
> on other benchmarkable use cases.
>

The problem is not specific to building a giant string. Strings are grown
all the time.

Theres some perl way to do this without explicitly assigning a large string
>

It involves open(\$var)

Re: Patch to make string-append on win32 100 times faster David Golden 7/31/10 4:44 AM
On Sat, Jul 31, 2010 at 1:14 AM, Eric Brine <ike...@adaelis.com> wrote:
> Theres some perl way to do this without explicitly assigning a large string
>>
>
> It involves open(\$var)

Sounds like a new candidate for Scalar::Util.

-- David

Re: Patch to make string-append on win32 100 times faster Wolfram Humann 7/31/10 5:33 AM
From: Jim Cromie <jim.c...@gmail.com>
Date: 31.07.2010 06:37

> On Fri, Jul 30, 2010 at 3:26 PM,  <w.c.h...@arcor.de> wrote:
>  
>> From: Marvin Humphrey <mar...@rectangular.com>
>> Date: 30.07.2010 17:34
>>    
>>>> * Why the "+10"?  Why have a extra minimum at all?  Why "10" instead
>>>> of another number?  Given your examples, for a large string the 10 is
>>>> a trivial fraction of the total and for a small string, the efficiency
>>>> either doesn't matter or the string quickly grows into a large string
>>>> anyway.
>>>>
>>>>        
>>> That's similar to my analysis.  I think it's more useful to round up to a
>>> multiple of the pointer size for small values.
>>>
>>>      
>> If I understand correctly (which I'm not certain about), this is done anyway
>> right afterwards in
>> newlen = PERL_STRLEN_ROUNDUP(newlen);
>>
>> Wolfram
>>
>>
>>    
>
> what happens if you presize the variable 1st ?
> then assign an empty string,
> then start appending the additions.
>
> this will avoid all the sv_grow calls,
> all the reallocs, and just use the one-time preallocated memory.
>
> this avoids all the mods to the heuristics youre proposing
> (modest tho they might be), and has no unintended effects
> on other benchmarkable use cases.
>  
There are many cases when you don't know the the final size of your
string. In that case you would need to invent a strategy similar to what
I proposed. Take my original script:

$ps_text = qx( pdftops some_pdf_file.pdf - );

The PDF file is about 2MB. I know it will grow during PDF -> Postscript
conversion. But it's hard to estimate how much it will grow. If I play
safe, I will likely waste much more memory than some well designed
perl-internal strategy.
Also, any perl compiled to use the internal malloc/realloc will "round
up to a certain geometric progression anyway" (Ilya Zakharevich). If you
look at Marvin Humphrey's post, you'll see that such a geometric growth
strategy (although with different details concerning acceptable memory
overhead etc.) is very common.

Wolfram


Re: Patch to make string-append on win32 100 times faster Nicholas Clark 7/31/10 8:04 AM

That's certainly what the expression in PERL_STRLEN_ROUNDUP() is intended to
do. (Round up to a multiple of the pointer size, so that growth of strings by
small amounts usually avoids repeated realloc() calls for an extra 1 or 2
bytes).

I'm not sure what my opinion on the main proposal is.
(I don't have time to read and digest all my mail currently)

Nicholas Clark

Patch to make string-append on win32 100 times faster Wolfram Humann 7/30/10 3:00 AM
My first post to p5p -- let me know if I should have gone about this
differently. This discussion started on c.l.p.m., subject "Speed of
reading some MB of data using qx(...)", see
http://groups.google.com/group/comp.lang.perl.misc/browse_thread/thread/b7c9133ff20009f2.
I'll try to summarize my current understanding of matters:

When a string grows (e.g. gets appended to), perl calls sv_grow. When
sv_grow decides that the memory currently allocated to the string is
insufficient, it calls saferealloc. Depending on whether or not perl was
compiled with 'usemymalloc' this calls realloc in either perls internal
version or on the operating system. Perl requests from realloc just the
amount of memory required for the current operation. With 'usemymalloc'
this is not a problem because it rounds up memory allocation to a
certain geometric progression anyway. When the operating system's
realloc is called, this may or may not lead to desaster, depending on
how it's implemented. On win32 it does lead to desaster: when I loop
1000 times and each time append 1000 chars to an initial string size of
10 million, the memory grows from 10.000e6 to 10.001e6 to 10.002e6 and
so on 1000 times till it ends at 11.000e6.

I would like to propose that perl requests an additional amount of

memory from realloc when strings grow by a small amount. After
discussion on c.l.p.m. and with the help of ideas from Ilya Zakharevich
and Peter J. Holzer my proposal reads like this:
(I'm not a regular user of diff and using it wrong might do more harm
than good)

In sv.c the following section in Perl_sv_grow():

     if (newlen > SvLEN(sv)) {        /* need more room? */


#ifndef Perl_safesysmalloc_size
     newlen = PERL_STRLEN_ROUNDUP(newlen);
#endif

should be expanded to:

#ifndef PERL_STRLEN_EXPAND_SHIFT
#  define PERL_STRLEN_EXPAND_SHIFT 2
#endif

     if (newlen > SvLEN(sv)) {        /* need more room? */


     size_t minlen  = SvCUR(sv);
     minlen += (minlen >> PERL_STRLEN_EXPAND_SHIFT) + 10;
         if (newlen < minlen) newlen = minlen;
/*        newlen += (newlen >> 1) + 10; */
#ifndef Perl_safesysmalloc_size
     newlen = PERL_STRLEN_ROUNDUP(newlen);
#endif

I've posted a benchmark with results on c.l.p.m.. Therefore, two
examples may suffice here:
a) Starting from an initial string of length 1e7 chars, append 1e6 chars
in 10000 chunks of  100 chars each.
b) Read long, multilined output from another process, simulated here by
reading a 12 MB postscript file using $ps = qx(cat postscriptfile.ps)

On current Strawberry Perl (5.10 or 5.12):


a) 2.6 seconds
b) 18 seconds

Perl 5.12 compiled on win32 with modifications as above:
a) 0.014 seconds
b) 0.2 seconds

The value of PERL_STRLEN_EXPAND_SHIFT determines the balance between

extra memory use and speed (i.e. frequency of calls to realloc). "2"
seems to me like a decent compromise.

Wolfram

Re: Patch to make string-append on win32 100 times faster demerphq 8/2/10 3:26 AM
On 31 July 2010 17:04, Nicholas Clark <ni...@ccl4.org> wrote:
> On Fri, Jul 30, 2010 at 11:26:15PM +0200, w.c.h...@arcor.de wrote:
>> From: Marvin Humphrey <mar...@rectangular.com>
>> Date: 30.07.2010 17:34
>> >>* Why the "+10"?  Why have a extra minimum at all?  Why "10" instead
>> >>of another number?  Given your examples, for a large string the 10 is
>> >>a trivial fraction of the total and for a small string, the efficiency
>> >>either doesn't matter or the string quickly grows into a large string
>> >>anyway.
>> >>
>> >
>> >That's similar to my analysis.  I think it's more useful to round up to a
>> >multiple of the pointer size for small values.
>> >
>> If I understand correctly (which I'm not certain about), this is done
>> anyway right afterwards in
>> newlen = PERL_STRLEN_ROUNDUP(newlen);
>
> That's certainly what the expression in PERL_STRLEN_ROUNDUP() is intended to
> do. (Round up to a multiple of the pointer size, so that growth of strings by
> small amounts usually avoids repeated realloc() calls for an extra 1 or 2
> bytes).

When i saw this I was confused, as I remember when Nicholas wrote the
patch that added PERL_STRLEN_ROUNDUP(), and Im wondering why this
patch doesnt modify /that/ instead of adding additional logic /in
front/ which will then be mixed together....

IOW, it seems like this patch will add some percentage, and then
PERL_STRLEN_ROUNDUP will add another percentage of /that/.

This doesnt seem right. Whats the problem PERL_STRLEN_ROUNDUP(), and,
given that PERL_STRLEN_ROUNDUP() is guarded like so:

#ifndef Perl_safesysmalloc_size
   newlen = PERL_STRLEN_ROUNDUP(newlen);
#endif

why should this new logic not also be guarded? Surely the intent is to
have a mode where Perl doesnt mess with requested size at all?

cheers,
yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

Re: Patch to make string-append on win32 100 times faster demerphq 8/2/10 3:36 AM

Here are three ways to do it in perl. The first is interesting as it
works slightly differently in that it enables the OOK flag as well,
which IMO is a bug in substr().

Yves

$ perl -MDevel::Peek -e'$size=10; $str=" " x $size;
substr($str,0,$size,""); Dump($str); $str.="ab"; Dump($str);'
SV = PVIV(0x8ac9fd4) at 0x8ad40c8
  REFCNT = 1
  FLAGS = (POK,OOK,pPOK)
  IV = 10  (OFFSET)
  PV = 0x8acafa2 ( "          " . ) ""\0
  CUR = 0
  LEN = 2
SV = PVIV(0x8ac9fd4) at 0x8ad40c8


  REFCNT = 1
  FLAGS = (POK,pPOK)
  IV = 0
  PV = 0x8acaf98 "ab"\0
  CUR = 2
  LEN = 12
$ perl -MDevel::Peek -e'$size=10; $str=" " x $size; $str=~s/.+//g;
Dump($str); $str.="ab"; Dump($str);'
SV = PV(0x92b1748) at 0x92d10c0

  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0x92c7f98 ""\0
  CUR = 0
  LEN = 12
SV = PV(0x92b1748) at 0x92d10c0

  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0x92c7f98 "ab"\0
  CUR = 2
  LEN = 12
$ perl -MDevel::Peek -e'$size=10; $str=" " x $size; chop($str) while
length($str); Dump($str); $str.="ab"; Dump($str);'
SV = PV(0x928c748) at 0x92ac0d0

  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0x92a2f98 ""\0
  CUR = 0
  LEN = 12
SV = PV(0x928c748) at 0x92ac0d0

  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0x92a2f98 "ab"\0
  CUR = 2
  LEN = 12


--
perl -Mre=debug -e "/just|another|perl|hacker/"

Re: Patch to make string-append on win32 100 times faster H.Merijn Brand 8/2/10 4:00 AM
On Mon, 2 Aug 2010 12:36:04 +0200, demerphq <deme...@gmail.com> wrote:

> On 31 July 2010 13:44, David Golden <xda...@gmail.com> wrote:
> > On Sat, Jul 31, 2010 at 1:14 AM, Eric Brine <ike...@adaelis.com> wrote:
> >> Theres some perl way to do this without explicitly assigning a large string
> >>>
> >>
> >> It involves open(\$var)
> >
> > Sounds like a new candidate for Scalar::Util.
>
> Here are three ways to do it in perl. The first is interesting as it
> works slightly differently in that it enables the OOK flag as well,
> which IMO is a bug in substr().

Fourth way is implemented in Data::Peek

$ perl -MDP -e'DDump $_;DGrow $_, 102400; DDump $_'
SV = NULL(0x0) at 0x8301568
  REFCNT = 1
  FLAGS = ()
SV = PV(0x82d1114) at 0x8301568


  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0x83d7558 ""\0
  CUR = 0
  LEN = 102400

   my $LEN = DGrow ($pv, $size)
       Fastest way to preallocate space for a PV scalar. Returns the allocated
       length. If $size is smaller than the already allocated space, it will
       not shrink.

        cmpthese (-2, {
            pack => q{my $x = ""; $x = pack "x20000"; $x = "";},
            op_x => q{my $x = ""; $x = "x"  x 20000;  $x = "";},
            grow => q{my $x = ""; DGrow ($x,  20000); $x = "";},
            });

                  Rate  op_x  pack  grow
        op_x   62127/s    --  -59%  -96%
        pack  152046/s  145%    --  -91%
        grow 1622943/s 2512%  967%    --

XS code:

void
DGrow (sv, size)
    SV     *sv
    IV      size

  PROTOTYPE: $$
  PPCODE:
    if (SvROK (sv))
        sv = SvRV (sv);
    if (!SvPOK (sv))
        sv_setpvn (sv, "", 0);
    SvGROW (sv, size);
    mPUSHi (SvLEN (sv));
    /* XS DGrow */

--
H.Merijn Brand  http://tux.nl      Perl Monger  http://amsterdam.pm.org/
using 5.00307 through 5.12 and porting perl5.13.x on HP-UX 10.20, 11.00,
11.11, 11.23, and 11.31, OpenSuSE 10.3, 11.0, and 11.1, AIX 5.2 and 5.3.
http://mirrors.develooper.com/hpux/           http://www.test-smoke.org/
http://qa.perl.org      http://www.goldmark.org/jeff/stupid-disclaimers/

RE: Patch to make string-append on win32 100 times faster Jan Dubois 8/13/10 5:39 PM
On Fri, 30 Jul 2010, Wolfram Humann wrote:

[...]


 
> I would like to propose that perl requests an additional amount of
> memory from realloc when strings grow by a small amount. After
> discussion on c.l.p.m. and with the help of ideas from Ilya Zakharevich
> and Peter J. Holzer my proposal reads like this:
> (I'm not a regular user of diff and using it wrong might do more harm
> than good)
>
> In sv.c the following section in Perl_sv_grow():
>
>      if (newlen > SvLEN(sv)) {        /* need more room? */
> #ifndef Perl_safesysmalloc_size
>      newlen = PERL_STRLEN_ROUNDUP(newlen);
> #endif
>
> should be expanded to:
>
> #ifndef PERL_STRLEN_EXPAND_SHIFT
> #  define PERL_STRLEN_EXPAND_SHIFT 2
> #endif
>
>      if (newlen > SvLEN(sv)) {        /* need more room? */
>      size_t minlen  = SvCUR(sv);
>      minlen += (minlen >> PERL_STRLEN_EXPAND_SHIFT) + 10;
>          if (newlen < minlen) newlen = minlen;
> /*        newlen += (newlen >> 1) + 10; */
> #ifndef Perl_safesysmalloc_size
>      newlen = PERL_STRLEN_ROUNDUP(newlen);
> #endif

Thank you for your patch!  I've now committed a slightly cleaned up version
to the Perl repository, and added your email address to the AUTHORS file:

http://perl5.git.perl.org/perl.git/commitdiff/f12005599f648e675af22dfef1047191e260bc48
http://perl5.git.perl.org/perl.git/commitdiff/1fd8fa480e535be8885c2c1b2e7e95c3aabbdf7f

Beyond formatting changes I did:

* change size_t to STRLEN for internal consistency
* removed the commented out code
* moved the PERL_STRLEN_EXPAND_SHIFT definition to perl.h

The discussion of this change seemed to have stalled, but I see
+1 votes from David Golden and Marvin Humphrey, with additional
information from Ben Morrow that the patch also helps on FreeBSD
(when building without -Dusemymalloc), with nobody voicing
any disagreement about applying the patch.

I've kept the original constants (PERL_STRLEN_EXPAND_SHIFT being
2, plus the minimum addition of 10).  While the shift overallocates
by 25%, this is based on the original length of the buffer, not
the requested new length. The reported 12.5% overallocation used
by Python (for arrays, not strings) is based on the new size already,
not the old.

The advantage of this algorithm is also that we don't overallocate
at all if the buffer already grows by more than 25%, so we won't
"waste" any space if 2 long strings are concatenated, only small
strings are repeatedly concatenated to an already long one.

I don't care much either way about the constant 10, but the price
for it is small: on a 32-bit system a null-string SV already
occupies 28 bytes (24 + 1*4), so once it grows to 32 bytes,
the "10" will actually expand it to 36 (24 + 3*4) bytes instead,
which doesn't matter either way.

Things that could benefit from further discussion are:

a) the algorithm in sv_grow() for expanding SVs with SvOOK

        if (newlen > SvLEN(sv))
            newlen += 10 * (newlen - SvCUR(sv)); /* avoid copy each time */

   This doesn't make much sense, and is catastrophic when
   you append a very long string to a short one.  I have not looked
   into the heritage of this algorithm yet.

b) Should the new over-allocation also be used under -Dusemyalloc.  It
   will provide less benefit there, but also come at a lower cost because
   the newlen can still fall inside the already allocated bucket size
   anyways.  I don't have any strong opinion if it should be disabled
   in this situation or not.

Cheers,
-Jan


Re: Patch to make string-append on win32 100 times faster Reini Urban 8/15/10 8:51 AM
Jan Dubois schrieb:

This particular slowdown was only recognized for WIN32 native malloc,
but not for other platforms with better malloc libs.
Those platforms are now hurt by using overallocation,
i.e. need more memory, e.g. with piping.

Why was this patch not applied with the appropriate
#if defined(_WIN32) or what is used for MSVC and mingw?

> a) the algorithm in sv_grow() for expanding SVs with SvOOK
>
>         if (newlen>  SvLEN(sv))
>             newlen += 10 * (newlen - SvCUR(sv)); /* avoid copy each time */
>
>     This doesn't make much sense, and is catastrophic when
>     you append a very long string to a short one.  I have not looked
>     into the heritage of this algorithm yet.
--
Reini Urban

RE: Patch to make string-append on win32 100 times faster Jan Dubois 8/15/10 11:19 PM
On Sun, 15 Aug 2010, Reini Urban wrote:
> Jan Dubois schrieb:
> > On Fri, 30 Jul 2010, Wolfram Humann wrote:
> > The discussion of this change seemed to have stalled, but I see
> > +1 votes from David Golden and Marvin Humphrey, with additional
> > information from Ben Morrow that the patch also helps on FreeBSD
> > (when building without -Dusemymalloc), with nobody voicing
> > any disagreement about applying the patch.
>
> This particular slowdown was only recognized for WIN32 native malloc,
> but not for other platforms with better malloc libs.

Did you read the paragraph you quoted above?  It explicitly claims that
the slowdown happens on other platforms when using the platform native
malloc.

> Those platforms are now hurt by using overallocation,
> i.e. need more memory, e.g. with piping.

Could you provide some evidence for this claim?  The only way a
"better malloc" can prevent this slowdown is by doing some kind
of overallocation itself.  Since the algorithm in this patch
is not necessarily cumulative with the overallocation by malloc()
it is very well possible that the change has no effect at all
on systems with a overallocating malloc().

For example, assume Perl is appending 100 bytes to a 1000 bytes string.
The patch under discussion will make sure that sv_grow() will request
1260 bytes (1000 + 1000>>2 + 10) instead of just 1100 bytes from realloc().

Assume the original 1000 bytes were allocated in a 1024 bytes slab in
the allocator, so the 1100 bytes wouldn't fit in anymore, and realloc()
will now move this to e.g. a 1536 byte slab.  In that case it doesn't
make any difference that we now asked for 1260 bytes instead of 1100.

Also note that the minimum growth is based on the old buffer size,
so appending 300 bytes to the 1000 byte string will only request
1300 bytes, because 1300 is already larger than the 1260 minimum
realloc growth.

So until proven otherwise I doubt that this patch has any noticeable
effect on a "good malloc()".  On a "medium malloc()" I would expect
it to improve performance somewhat, at a moderate additional memory
requirement.

> Why was this patch not applied with the appropriate
> #if defined(_WIN32) or what is used for MSVC and mingw?

Because then it wouldn't be applied to other platforms, like FreeBSD.

Note thought that I added this remark for discussion about disabling it
under -Dusemyalloc:

| b) Should the new over-allocation also be used under -Dusemyalloc.  It
|    will provide less benefit there, but also come at a lower cost because
|    the newlen can still fall inside the already allocated bucket size
|    anyways.  I don't have any strong opinion if it should be disabled
|    in this situation or not.

But as I stated above, I don't think it will make much of a difference either
way for the -Dusemyalloc case, but would love to see some comprehensive
benchmarks.

Cheers,
-Jan


Re: Patch to make string-append on win32 100 times faster Reini Urban 8/16/10 4:18 AM
2010/8/16 Jan Dubois <ja...@activestate.com>:

> On Sun, 15 Aug 2010, Reini Urban wrote:
>> Jan Dubois schrieb:
>> > On Fri, 30 Jul 2010, Wolfram Humann wrote:
>> > The discussion of this change seemed to have stalled, but I see
>> > +1 votes from David Golden and Marvin Humphrey, with additional
>> > information from Ben Morrow that the patch also helps on FreeBSD
>> > (when building without -Dusemymalloc), with nobody voicing
>> > any disagreement about applying the patch.
>>
>> This particular slowdown was only recognized for WIN32 native malloc,
>> but not for other platforms with better malloc libs.
>
> Did you read the paragraph you quoted above?  It explicitly claims that
> the slowdown happens on other platforms when using the platform native
> malloc.
>
>> Those platforms are now hurt by using overallocation,
>> i.e. need more memory, e.g. with piping.
>
> Could you provide some evidence for this claim?  The only way a
> "better malloc" can prevent this slowdown is by doing some kind
> of overallocation itself.  Since the algorithm in this patch
> is not necessarily cumulative with the overallocation by malloc()
> it is very well possible that the change has no effect at all
> on systems with a overallocating malloc().

This is just theory. I KNOW that plain gnumalloc and freebsd realloc
do work fine.
But now that we have the mess someone can test it. I don't have such systems
so I cannot test it.

> For example, assume Perl is appending 100 bytes to a 1000 bytes string.
> The patch under discussion will make sure that sv_grow() will request
> 1260 bytes (1000 + 1000>>2 + 10) instead of just 1100 bytes from realloc().
>
> Assume the original 1000 bytes were allocated in a 1024 bytes slab in
> the allocator, so the 1100 bytes wouldn't fit in anymore, and realloc()
> will now move this to e.g. a 1536 byte slab.  In that case it doesn't
> make any difference that we now asked for 1260 bytes instead of 1100.
>
> Also note that the minimum growth is based on the old buffer size,
> so appending 300 bytes to the 1000 byte string will only request
> 1300 bytes, because 1300 is already larger than the 1260 minimum
> realloc growth.
>
> So until proven otherwise I doubt that this patch has any noticeable
> effect on a "good malloc()".  On a "medium malloc()" I would expect
> it to improve performance somewhat, at a moderate additional memory
> requirement.
>
>> Why was this patch not applied with the appropriate
>> #if defined(_WIN32) or what is used for MSVC and mingw?
>
> Because then it wouldn't be applied to other platforms, like FreeBSD.

Uuh, nobody ever complained about freebsd realloc performance.
It was always the fastest on the planet and still is.

> Note thought that I added this remark for discussion about disabling it
> under -Dusemyalloc:
>
> | b) Should the new over-allocation also be used under -Dusemyalloc.  It
> |    will provide less benefit there, but also come at a lower cost because
> |    the newlen can still fall inside the already allocated bucket size
> |    anyways.  I don't have any strong opinion if it should be disabled
> |    in this situation or not.
>
> But as I stated above, I don't think it will make much of a difference either
> way for the -Dusemyalloc case, but would love to see some comprehensive
> benchmarks.

At http://groups.google.com/group/comp.lang.perl.misc/browse_thread/thread/b7c9133ff20009f2?pli=1
were a lot of benchmarks and profiling data.
He is testing string sizes of 1e5-1e7 byte not just pagesizes, piping
a typical pdf to perl.

Plain freebsd had for years the best realloc performance (first with
phkmalloc, now
they switched to jemalloc for better multi-core performance), which were always
faster than simple gnumalloc (ptmalloc 2 or 3 based on Doug Lea's
public domain malloc),
but "gnumalloc" was fast enough. 12ms against 16sec.

Without this patch.

So I don't see any reason to "fix" gnumalloc or freebsd realloc
default behaviour, unless
someone reports problems there. Tuning realloc is black art ( I did it
for Tie::CArray once )
and don't want someone to touch this without any tests.

Only msvcrt is affected and so only platforms which use msvcrt realloc should
be patched, esp. without any tests on the other platforms.

FYI: cygwin (newlib) uses normally freebsd derived libc
implementations, but in this case
cygwin uses not phkmalloc, just ptmalloc2.6.4 i.e. plain gnumalloc.
- which has no public independent_comalloc() which is a shame btw.

-Dusemymalloc is overallocating a lot, but we know that. This is not a
realloc problem per se,
plain malloc does the same.
--
Reini Urban

Re: Patch to make string-append on win32 100 times faster David Golden 8/16/10 6:37 AM
I think I'm with Reini on this one. I'd rather keep it Windows-only until a
performance problem is documented elsewhere.

It's easy to enable for other platforms when needed.

I'm open to people doing the research before 5.14, of course. :-)

David

Re: Patch to make string-append on win32 100 times faster Ben Morrow 8/16/10 10:19 AM
Quoth rur...@x-ray.at (Reini Urban):
> 2010/8/16 Jan Dubois <ja...@activestate.com>:

> >
> > Could you provide some evidence for this claim?  The only way a
> > "better malloc" can prevent this slowdown is by doing some kind
> > of overallocation itself.  Since the algorithm in this patch
> > is not necessarily cumulative with the overallocation by malloc()
> > it is very well possible that the change has no effect at all
> > on systems with a overallocating malloc().
>
> This is just theory. I KNOW that plain gnumalloc and freebsd realloc
> do work fine.
> But now that we have the mess someone can test it. I don't have such systems
> so I cannot test it.

The test I referred to running under FreeBSD is at
http://groups.google.com/group/comp.lang.perl.misc/msg/9e373f4d1e755456
, and the benchmark in question is at
http://groups.google.com/group/comp.lang.perl.misc/msg/1e3c56e7c7360de3
. The difference between -Dusemymalloc and not is *huge*, for instance

    5.8.8, system malloc
        1E7 chars + 1E4 x 1E2 chars: 7159.0 ms
    5.8.8, mymalloc
        1E7 chars + 1E4 x 1E2 chars:   45.3 ms

Now, it's possible that benchmark is faulty, or that perl's allocation
strategy has changed significantly since 5.8.8, but you would need to
give some reason for believing that before blindly asserting that
FreeBSD's malloc is the fastest thing in the world and nothing could
possibly improve its performance.

> Uuh, nobody ever complained about freebsd realloc performance.
> It was always the fastest on the planet and still is.

Not for perl's usage patterns, apparently. This is borne out by the fact
that ports perl has been built with -Dusemymalloc forever.

> > Note thought that I added this remark for discussion about disabling it
> > under -Dusemyalloc:
> >
> > | b) Should the new over-allocation also be used under -Dusemyalloc.  It
> > |    will provide less benefit there, but also come at a lower cost because
> > |    the newlen can still fall inside the already allocated bucket size
> > |    anyways.  I don't have any strong opinion if it should be disabled
> > |    in this situation or not.
> >
> > But as I stated above, I don't think it will make much of a difference either
> > way for the -Dusemyalloc case, but would love to see some comprehensive
> > benchmarks.
>
> At
> http://groups.google.com/group/comp.lang.perl.misc/browse_thread/thread/b7c9133ff20009f2?pli=1
> were a lot of benchmarks and profiling data.
> He is testing string sizes of 1e5-1e7 byte not just pagesizes, piping
> a typical pdf to perl.

Are you saying this is a bad benchmark, for some reason? If you can
supply a better I'd be happy to run it for you.

Ben

RE: Patch to make string-append on win32 100 times faster Jan Dubois 8/16/10 12:17 PM
On Mon, 16 Aug 2010, Reini Urban wrote:
> 2010/8/16 Jan Dubois <ja...@activestate.com>:

> > Could you provide some evidence for this claim?  The only way a
> > "better malloc" can prevent this slowdown is by doing some kind
> > of overallocation itself.  Since the algorithm in this patch
> > is not necessarily cumulative with the overallocation by malloc()
> > it is very well possible that the change has no effect at all
> > on systems with a overallocating malloc().
>
> This is just theory. I KNOW that plain gnumalloc and freebsd realloc
> do work fine.
> But now that we have the mess someone can test it. I don't have such systems
> so I cannot test it.

I'm getting a bit tired of you discarding Ben's actual test based on
some "knowledge" you imply to have that you can't even verify because
you don't have access to those systems.  This is a technical issue, not
a religious belief!

[...]

> > Because then it wouldn't be applied to other platforms, like FreeBSD.
>
> Uuh, nobody ever complained about freebsd realloc performance.
> It was always the fastest on the planet and still is.

I pointed out *twice* to you that Ben did actually measure it and came up
with pretty bad numbers for the freebsd reallocator.  You will need to come
up with some evidence why his benchmarks should be discarded.

To provide similar numbers for Linux and GNU malloc I've compiled bleadperl
just before and after the patch in question, both with usemymalloc=y
and usemymalloc=n.  The results are just for the "1E7 chars + 1E5 x 1E1 chars"
benchmark, as that is the slowest of the bunch.  I've run the benchmark
script 100 times for each Perl build and show the min/max runtimes to show
that there is quite a bit of noise:

Before, GNU malloc:  Min=38.6 Max=45.4 Avg=41.20
Before, Perl malloc: Min= 7.9 Max=14.2 Avg=11.14

After, GNU malloc:   Min= 9.7 Max=12.7 Avg=11.45
After, Perl malloc:  Min= 9.4 Max=13.2 Avg=11.16

It shows that GNU malloc on its own takes 4 times as long as GNU malloc with
the patch.  GNU malloc with this patch matches the time used by the Perl
malloc (usemymalloc=y), which doesn't seem to be affected by the patch.

> > Note thought that I added this remark for discussion about disabling it
> > under -Dusemyalloc:
> >
> > | b) Should the new over-allocation also be used under -Dusemyalloc.  It
> > |    will provide less benefit there, but also come at a lower cost because
> > |    the newlen can still fall inside the already allocated bucket size
> > |    anyways.  I don't have any strong opinion if it should be disabled
> > |    in this situation or not.
> >
> > But as I stated above, I don't think it will make much of a difference either
> > way for the -Dusemyalloc case, but would love to see some comprehensive
> > benchmarks.
>
> At http://groups.google.com/group/comp.lang.perl.misc/browse_thread/thread/b7c9133ff20009f2?pli=1
> were a lot of benchmarks and profiling data.
> He is testing string sizes of 1e5-1e7 byte not just pagesizes, piping
> a typical pdf to perl.

That is exactly the script I have been running in my tests above.


 
> Plain freebsd had for years the best realloc performance (first with
> phkmalloc, now
> they switched to jemalloc for better multi-core performance), which were always
> faster than simple gnumalloc (ptmalloc 2 or 3 based on Doug Lea's
> public domain malloc),
> but "gnumalloc" was fast enough. 12ms against 16sec.
>
> Without this patch.

See above for a comparison between GNU malloc and Perl malloc.  Note that
the Google thread you reference *also* shows that Perl malloc performs
3 times better than GNU malloc at this particular benchmark.


 
> So I don't see any reason to "fix" gnumalloc or freebsd realloc
> default behaviour, unless
> someone reports problems there. Tuning realloc is black art ( I did it
> for Tie::CArray once )
> and don't want someone to touch this without any tests.

Nobody is tuning realloc(), this is all about improving the performance of
sv_grow().


 
> Only msvcrt is affected and so only platforms which use msvcrt realloc should
> be patched, esp. without any tests on the other platforms.

This is really getting old now...


 
> FYI: cygwin (newlib) uses normally freebsd derived libc
> implementations, but in this case
> cygwin uses not phkmalloc, just ptmalloc2.6.4 i.e. plain gnumalloc.
> - which has no public independent_comalloc() which is a shame btw.
>
> -Dusemymalloc is overallocating a lot, but we know that. This is not a
> realloc problem per se, plain malloc does the same.

You seem to have Cygwin available to you.  Why don't you just test the
patch and report on any actual problems instead of spreading bad attitude?

Cheers,
-Jan

Re: Patch to make string-append on win32 100 times faster Wolfram Humann 8/16/10 3:39 PM
On 16.08.2010 21:17, Jan Dubois wrote:
> To provide similar numbers for Linux and GNU malloc I've compiled
> bleadperl
> just before and after the patch in question, both with usemymalloc=y
> and usemymalloc=n.  The results are just for the "1E7 chars + 1E5 x
> 1E1 chars"
> benchmark, as that is the slowest of the bunch.  I've run the benchmark
> script 100 times for each Perl build and show the min/max runtimes to
> show
> that there is quite a bit of noise:
>
> Before, GNU malloc:  Min=38.6 Max=45.4 Avg=41.20
> Before, Perl malloc: Min= 7.9 Max=14.2 Avg=11.14
>
> After, GNU malloc:   Min= 9.7 Max=12.7 Avg=11.45
> After, Perl malloc:  Min= 9.4 Max=13.2 Avg=11.16
>
> It shows that GNU malloc on its own takes 4 times as long as GNU
> malloc with
> the patch.  GNU malloc with this patch matches the time used by the Perl
> malloc (usemymalloc=y), which doesn't seem to be affected by the patch.

Hm, the differences I get on my Linux go in the same direction but are
less impressive. On an unpatched perl I get around 12 ms with
usemymalloc=n and around 8 ms with usemymalloc=y.

But I also tried to follow up on Peter J. Holzer's remarks in
<http://groups.google.com/group/comp.lang.perl.misc/msg/8747b64794aec27b>.
He used strace for a closer look at what GNU malloc does and found that
it leaves free space after allocated memory so that it can often realloc
without moving  memory. This free space (as far as I understand form
watching private memory usage reported by Linux::Smaps) is not
considered part of perl's memory ntil actually used. He also suggests
that this algorithm -- like most "clever" ones -- can be broken. Looks
like it can:
(All on an unpatched perl 5.10.1 with usemymalloc=n and usemymalloc=y.
Starting with an empty my @ar2;. Code given is an additional line in my
array of tests, but could also run stand-alone)

1) 'append $i' => sub{ for my $i (1..1E3){ $ar2[$i] .= $c1E2 for 1..1E3 } }
That's the 'good' case: Append 1000 times to one string, move to the
next string, append 1000 times to that one, move to the next...
GNU malloc:    238.8 ms
Perl malloc: 164.8 ms

2) 'append $_' => sub{ for my $i (1..1E3){ $ar2[$_] .= $c1E2 for 1..1E3 } }
That's the 'bad' case: append to each string once, than again to each
string,...
GNU malloc:  57143.0 ms
Perl malloc: 298.0 ms

Jan, if possible, could you run that on your before-and-after-the-patch
compiles?

Wolfram

RE: Patch to make string-append on win32 100 times faster Jan Dubois 8/16/10 3:47 PM
On Mon, 16 Aug 2010, Wolfram Humann wrote:
> (All on an unpatched perl 5.10.1 with usemymalloc=n and usemymalloc=y.
> Starting with an empty my @ar2;. Code given is an additional line in my
> array of tests, but could also run stand-alone)
>
> 1) 'append $i' => sub{ for my $i (1..1E3){ $ar2[$i] .= $c1E2 for 1..1E3 } }
> That's the 'good' case: Append 1000 times to one string, move to the
> next string, append 1000 times to that one, move to the next...
> GNU malloc:    238.8 ms
> Perl malloc: 164.8 ms
>
> 2) 'append $_' => sub{ for my $i (1..1E3){ $ar2[$_] .= $c1E2 for 1..1E3 } }
> That's the 'bad' case: append to each string once, than again to each
> string,...
> GNU malloc:  57143.0 ms
> Perl malloc: 298.0 ms
>
> Jan, if possible, could you run that on your before-and-after-the-patch
> compiles?

Before, GNU alloc:
        append $i:   220.4 ms
        append $_: 12879.6 ms

Before, Perl alloc:
        append $i:  200.4 ms
        append $_:  297.6 ms

After, GNU alloc:
        append $i:  190.9 ms
        append $_:  282.8 ms

After, Perl alloc:
        append $i:  198.2 ms
        append $_:  269.7 ms

So I guess as expected: GNU malloc improve a lot, Perl malloc is unaffected.

I haven't looked at total memory consumption for these tests, so I can't
confirm that the memory usage for Perl malloc() is unchanged.  I strongly
suspect that to be the case though.

Cheers,
-Jan


RE: Patch to make string-append on win32 100 times faster Jan Dubois 8/16/10 4:09 PM
On Mon, 16 Aug 2010, David Golden wrote:
>
> I think I'm with Reini on this one. I'd rather keep it Windows-only until a
> performance problem is documented elsewhere.

It's been documented that the patch improves performance on FreeBSD by a
*factor* of 160 for one particular test case, as well as a factor of 4
on Linux. In the meantime Wolfram has constructed another testcase where
the patch improves performance on Linux by a factor of 45 (my system) to
230 (Wolfram's system).

Granted, these are worst-case samples, but at least the first test
program was derived from a real-world application: appending lots of
small strings to an ever-growing large buffer.

The performance improvements have only been observed with
usemymalloc=n, bringing that case up to par with the usemymalloc=y case
performance-wise.

The usemymalloc=y case seems (as expected) unaffected by this change. So
disabling the code for that case won't make a difference. I generally
prefer to keep code as simple as possible though, so I'm not sure if the
patch needs to be disabled for usemymalloc=y. This is more a
philosophical than a technical issue.

> It's easy to enable for other platforms when needed.

So far the patch has improved performance for *every* platform malloc() it
has been tried on.  The only malloc() that doesn't improve by it is the
Perl malloc from Ilya.
 
> I'm open to people doing the research before 5.14, of course. :-)

Given the way the algorithm works, and that it has shown to improve
performance on Windows, Linux and FreeBSD I claim it is now time for
the detractors to find a system where the change has any negative
consequences.

Cheers,
-Jan

Re: Patch to make string-append on win32 100 times faster Ben Morrow 8/16/10 4:25 PM
Quoth w.c.h...@arcor.de (Wolfram Humann):
>
> But I also tried to follow up on Peter J. Holzer's remarks in
> <http://groups.google.com/group/comp.lang.perl.misc/msg/8747b64794aec27b>.
> He used strace for a closer look at what GNU malloc does and found that
> it leaves free space after allocated memory so that it can often realloc
> without moving  memory.

Does glibc have a malloc_size() function (or equivalent), and does perl
use it? You can check with perl -V:'d_malloc.*_size'. If there is such a
function but perl isn't finding it it may be worth fixing that.

FreeBSD appears to have an equivalent function called malloc_usable_size,
defined in <malloc_np.h>, which perl doesn't know about. When I get a
chance I'll try hacking it in to see if it makes a difference (I don't
think I've a hope of doing a proper Configure patch for it :().

> This free space (as far as I understand form
> watching private memory usage reported by Linux::Smaps) is not
> considered part of perl's memory ntil actually used.

This is usual. Mapped memory that hasn't been touched yet isn't actually
allocated anywhere by the kernel. (Linux is more aggressive about
allowing over-allocation from userland than some other OSen.)

Ben

Re: Patch to make string-append on win32 100 times faster David Golden 8/16/10 5:20 PM
On Mon, Aug 16, 2010 at 7:09 PM, Jan Dubois <ja...@activestate.com> wrote:
> On Mon, 16 Aug 2010, David Golden wrote:
>>
>> I think I'm with Reini on this one. I'd rather keep it Windows-only until a
>> performance problem is documented elsewhere.
>
> It's been documented that the patch improves performance on FreeBSD by a
> *factor* of 160 for one particular test case, as well as a factor of 4
> on Linux. In the meantime Wolfram has constructed another testcase where
> the patch improves performance on Linux by a factor of 45 (my system) to
> 230 (Wolfram's system).

+1

-- David

Re: Patch to make string-append on win32 100 times faster Zefram 8/16/10 10:45 PM
Ben Morrow wrote:
>Does glibc have a malloc_size() function (or equivalent),

No.

-zefram

Re: Patch to make string-append on win32 100 times faster Elizabeth Mattijsen 8/17/10 1:15 AM
On Aug 17, 2010, at 10:09 AM, Wolfram Humann wrote:

> -------- Original Message --------
> Subject: Re: Patch to make string-append on win32 100 times faster
> From: Jan Dubois <ja...@activestate.com>
> To: 'David Golden' <xda...@gmail.com>, 'Reini Urban' <rur...@x-ray.at>
> Date: 17.08.2010 01:09
>> On Mon, 16 Aug 2010, David Golden wrote:
>>  
>>> I think I'm with Reini on this one. I'd rather keep it Windows-only until a
>>> performance problem is documented elsewhere.
>>>    
>>
>> It's been documented that the patch improves performance on FreeBSD by a
>> *factor* of 160 for one particular test case, as well as a factor of 4
>> on Linux. In the meantime Wolfram has constructed another testcase where
>> the patch improves performance on Linux by a factor of 45 (my system) to
>> 230 (Wolfram's system).
>>
>> Granted, these are worst-case samples, but at least the first test
>> program was derived from a real-world application: appending lots of
>> small strings to an ever-growing large buffer.
> I think that even my worst-case Linux test case is not totally out-of-the-world. It occurs whenever a large number of strings grows simultaneously by small increments. One application that comes to mind is pivoting (exchanging rows and columns of) a large text-based table: loop over the lines, splitting each row and appending the chunks to the respective column's string; finally print the column strings as new rows when the complete table is read.

FWIW, I have several pieces of important business logic that add packed integers to hash values for millions of iterations.  Seems like a worst case to me.

Having said that, I've been moving these cases to use a temporary file, basically using the file system as a hash.  This in the end, proved to work a lot faster and scaled much better, way beyond the physical RAM that was available.

Liz

Re: Patch to make string-append on win32 100 times faster Wolfram Humann 8/17/10 1:09 AM
-------- Original Message --------
Subject: Re: Patch to make string-append on win32 100 times faster
From: Jan Dubois <ja...@activestate.com>
To: 'David Golden' <xda...@gmail.com>, 'Reini Urban' <rur...@x-ray.at>
Date: 17.08.2010 01:09
> On Mon, 16 Aug 2010, David Golden wrote:
>  
>> I think I'm with Reini on this one. I'd rather keep it Windows-only until a
>> performance problem is documented elsewhere.
>>    
>
> It's been documented that the patch improves performance on FreeBSD by a
> *factor* of 160 for one particular test case, as well as a factor of 4
> on Linux. In the meantime Wolfram has constructed another testcase where
> the patch improves performance on Linux by a factor of 45 (my system) to
> 230 (Wolfram's system).
>
> Granted, these are worst-case samples, but at least the first test
> program was derived from a real-world application: appending lots of
> small strings to an ever-growing large buffer.
I think that even my worst-case Linux test case is not totally
out-of-the-world. It occurs whenever a large number of strings grows
simultaneously by small increments. One application that comes to mind
is pivoting (exchanging rows and columns of) a large text-based table:
loop over the lines, splitting each row and appending the chunks to the
respective column's string; finally print the column strings as new rows
when the complete table is read.

Wolfram

Re: Patch to make string-append on win32 100 times faster Reini Urban 8/17/10 2:19 AM
2010/8/16 Jan Dubois <ja...@activestate.com>:
> On Mon, 16 Aug 2010, Reini Urban wrote:
>> 2010/8/16 Jan Dubois <ja...@activestate.com>:
>> > Could you provide some evidence for this claim?  The only way a
>> > "better malloc" can prevent this slowdown is by doing some kind
>> > of overallocation itself.  Since the algorithm in this patch
>> > is not necessarily cumulative with the overallocation by malloc()
>> > it is very well possible that the change has no effect at all
>> > on systems with a overallocating malloc().
>>
>> This is just theory. I KNOW that plain gnumalloc and freebsd realloc
>> do work fine.
>> But now that we have the mess someone can test it. I don't have such systems
>> so I cannot test it.
>
> I'm getting a bit tired of you discarding Ben's actual test based on
> some "knowledge" you imply to have that you can't even verify because
> you don't have access to those systems.  This is a technical issue, not
> a religious belief!

I just studied the technical papers which discuss different system mallocs.

>> > Because then it wouldn't be applied to other platforms, like FreeBSD.
>>
>> Uuh, nobody ever complained about freebsd realloc performance.
>> It was always the fastest on the planet and still is.
>
> I pointed out *twice* to you that Ben did actually measure it and came up
> with pretty bad numbers for the freebsd reallocator.  You will need to come
> up with some evidence why his benchmarks should be discarded.

Sorry Jan and Ben. So I seem to have misread that thread.
I'm satisfied with Wolfram's patch.

> To provide similar numbers for Linux and GNU malloc I've compiled bleadperl
> just before and after the patch in question, both with usemymalloc=y
> and usemymalloc=n.  The results are just for the "1E7 chars + 1E5 x 1E1 chars"
> benchmark, as that is the slowest of the bunch.  I've run the benchmark
> script 100 times for each Perl build and show the min/max runtimes to show
> that there is quite a bit of noise:
>
> Before, GNU malloc:  Min=38.6 Max=45.4 Avg=41.20
> Before, Perl malloc: Min= 7.9 Max=14.2 Avg=11.14
>
> After, GNU malloc:   Min= 9.7 Max=12.7 Avg=11.45
> After, Perl malloc:  Min= 9.4 Max=13.2 Avg=11.16
>
> It shows that GNU malloc on its own takes 4 times as long as GNU malloc with
> the patch.  GNU malloc with this patch matches the time used by the Perl
> malloc (usemymalloc=y), which doesn't seem to be affected by the patch.

> You seem to have Cygwin available to you.  Why don't you just test the
> patch and report on any actual problems instead of spreading bad attitude?

Because I have other problems with blead right now.
Somehow my markstack is corrupted, so I get crashes in the simpliest
dynaloader boot_ calls.

Sorry for my bad attitude. I thought there were no benchmarks for this
patch on the general platforms.
--
Reini Urban

Re: Patch to make string-append on win32 100 times faster Wolfram Humann 8/17/10 11:18 AM

I tried to look at memory consumption using Linux::Smaps but my simple
test case (in an endless loop: append 1024000 chars to a string,  report
memory usage) shows way too little memory overhead, especially when my
patch or usemymalloc (which, when I looked at memory use on win32,
always results in even bigger memory footprints than my patch) is
active. I have no experience with Linux::Smaps. Is it the wrong tool for
this purpose? Any ideas what to use instead?

Command line:
perl -MLinux::Smaps -MConfig -E'$u=usemymalloc; say "$u: ",$Config{$u};
$m=Linux::Smaps->new; while(1){$s .= "#" for 1..1024000; $m->update;
printf"$_:%6d, ",$m->$_ for qw(rss shared_clean shared_dirty
private_clean private_dirty); printf"string:%6d k\n",length($s)/1024} '

First 10 results for each of the four cases:

Before, GNU alloc:
usemymalloc: n
rss:  4052, shared_clean:  1600, shared_dirty:     0, private_clean:     4, private_dirty:  2448, string:  1000 k
rss:  5056, shared_clean:  1600, shared_dirty:     0, private_clean:     8, private_dirty:  3448, string:  2000 k
rss:  6056, shared_clean:  1600, shared_dirty:     0, private_clean:     8, private_dirty:  4448, string:  3000 k
rss:  7056, shared_clean:  1600, shared_dirty:     0, private_clean:     8, private_dirty:  5448, string:  4000 k
rss:  8056, shared_clean:  1600, shared_dirty:     0, private_clean:     8, private_dirty:  6448, string:  5000 k
rss:  9056, shared_clean:  1600, shared_dirty:     0, private_clean:     8, private_dirty:  7448, string:  6000 k
rss: 10056, shared_clean:  1600, shared_dirty:     0, private_clean:     8, private_dirty:  8448, string:  7000 k
rss: 11056, shared_clean:  1600, shared_dirty:     0, private_clean:     8, private_dirty:  9448, string:  8000 k
rss: 12056, shared_clean:  1600, shared_dirty:     0, private_clean:     8, private_dirty: 10448, string:  9000 k
rss: 13056, shared_clean:  1600, shared_dirty:     0, private_clean:     8, private_dirty: 11448, string: 10000 k

Before, Perl alloc:
usemymalloc: y
rss:  3992, shared_clean:   576, shared_dirty:     0, private_clean:  1036, private_dirty:  2380, string:  1000 k
rss:  4996, shared_clean:   576, shared_dirty:     0, private_clean:  1040, private_dirty:  3380, string:  2000 k
rss:  5996, shared_clean:   576, shared_dirty:     0, private_clean:  1040, private_dirty:  4380, string:  3000 k
rss:  6996, shared_clean:   576, shared_dirty:     0, private_clean:  1040, private_dirty:  5380, string:  4000 k
rss:  7996, shared_clean:   576, shared_dirty:     0, private_clean:  1040, private_dirty:  6380, string:  5000 k
rss:  8996, shared_clean:   576, shared_dirty:     0, private_clean:  1040, private_dirty:  7380, string:  6000 k
rss:  9996, shared_clean:   576, shared_dirty:     0, private_clean:  1040, private_dirty:  8380, string:  7000 k
rss: 10996, shared_clean:   576, shared_dirty:     0, private_clean:  1040, private_dirty:  9380, string:  8000 k
rss: 11996, shared_clean:   576, shared_dirty:     0, private_clean:  1040, private_dirty: 10380, string:  9000 k
rss: 12996, shared_clean:   576, shared_dirty:     0, private_clean:  1040, private_dirty: 11380, string: 10000 k

After, GNU alloc:
usemymalloc: n
rss:  4068, shared_clean:   584, shared_dirty:     0, private_clean:  1024, private_dirty:  2460, string:  1000 k
rss:  5072, shared_clean:   584, shared_dirty:     0, private_clean:  1028, private_dirty:  3460, string:  2000 k
rss:  6072, shared_clean:   584, shared_dirty:     0, private_clean:  1028, private_dirty:  4460, string:  3000 k
rss:  7072, shared_clean:   584, shared_dirty:     0, private_clean:  1028, private_dirty:  5460, string:  4000 k
rss:  8072, shared_clean:   584, shared_dirty:     0, private_clean:  1028, private_dirty:  6460, string:  5000 k
rss:  9072, shared_clean:   584, shared_dirty:     0, private_clean:  1028, private_dirty:  7460, string:  6000 k
rss: 10072, shared_clean:   584, shared_dirty:     0, private_clean:  1028, private_dirty:  8460, string:  7000 k
rss: 11072, shared_clean:   584, shared_dirty:     0, private_clean:  1028, private_dirty:  9460, string:  8000 k
rss: 12072, shared_clean:   584, shared_dirty:     0, private_clean:  1028, private_dirty: 10460, string:  9000 k
rss: 13072, shared_clean:   584, shared_dirty:     0, private_clean:  1028, private_dirty: 11460, string: 10000 k

After, Perl alloc:
usemymalloc: y
rss:  4000, shared_clean:   576, shared_dirty:     0, private_clean:  1036, private_dirty:  2388, string:  1000 k
rss:  6036, shared_clean:   576, shared_dirty:     0, private_clean:  1040, private_dirty:  4420, string:  2000 k
rss:  7036, shared_clean:   576, shared_dirty:     0, private_clean:  1040, private_dirty:  5420, string:  3000 k
rss:  8036, shared_clean:   576, shared_dirty:     0, private_clean:  1040, private_dirty:  6420, string:  4000 k
rss:  9036, shared_clean:   576, shared_dirty:     0, private_clean:  1040, private_dirty:  7420, string:  5000 k
rss: 10036, shared_clean:   576, shared_dirty:     0, private_clean:  1040, private_dirty:  8420, string:  6000 k
rss: 11036, shared_clean:   576, shared_dirty:     0, private_clean:  1040, private_dirty:  9420, string:  7000 k
rss: 12036, shared_clean:   576, shared_dirty:     0, private_clean:  1040, private_dirty: 10420, string:  8000 k
rss: 13036, shared_clean:   576, shared_dirty:     0, private_clean:  1040, private_dirty: 11420, string:  9000 k
rss: 14036, shared_clean:   576, shared_dirty:     0, private_clean:  1040, private_dirty: 12420, string: 10000 k

Re: Patch to make string-append on win32 100 times faster demerphq 8/20/10 7:53 AM
On 16 August 2010 08:19, Jan Dubois <ja...@activestate.com> wrote:
> On Sun, 15 Aug 2010, Reini Urban wrote:
>> Jan Dubois schrieb:
>> > On Fri, 30 Jul 2010, Wolfram Humann wrote:
>> > The discussion of this change seemed to have stalled, but I see
>> > +1 votes from David Golden and Marvin Humphrey, with additional
>> > information from Ben Morrow that the patch also helps on FreeBSD
>> > (when building without -Dusemymalloc), with nobody voicing
>> > any disagreement about applying the patch.
>>
>> This particular slowdown was only recognized for WIN32 native malloc,
>> but not for other platforms with better malloc libs.
>
> Did you read the paragraph you quoted above?  It explicitly claims that
> the slowdown happens on other platforms when using the platform native
> malloc.
>
>> Those platforms are now hurt by using overallocation,
>> i.e. need more memory, e.g. with piping.

>
> Could you provide some evidence for this claim?  The only way a
> "better malloc" can prevent this slowdown is by doing some kind
> of overallocation itself.

This is not correct. Mallocs/reallocs that can merge blocks do not
have the performance penalty that this algorithm seeks to work around.
The problem here is that the Win32 realloc always copies, and thus
extending a block a character at a time becomes exponential. With a
realloc that merges blocks and only copies where there is insufficient
contiguous blocks does not have this problem.

> Since the algorithm in this patch
> is not necessarily cumulative with the overallocation by malloc()
> it is very well possible that the change has no effect at all
> on systems with a overallocating malloc().

> For example, assume Perl is appending 100 bytes to a 1000 bytes string.
> The patch under discussion will make sure that sv_grow() will request
> 1260 bytes (1000 + 1000>>2 + 10) instead of just 1100 bytes from realloc().
>
> Assume the original 1000 bytes were allocated in a 1024 bytes slab in
> the allocator, so the 1100 bytes wouldn't fit in anymore, and realloc()
> will now move this to e.g. a 1536 byte slab.  In that case it doesn't
> make any difference that we now asked for 1260 bytes instead of 1100.
>
> Also note that the minimum growth is based on the old buffer size,
> so appending 300 bytes to the 1000 byte string will only request
> 1300 bytes, because 1300 is already larger than the 1260 minimum
> realloc growth.
>
> So until proven otherwise I doubt that this patch has any noticeable
> effect on a "good malloc()".  On a "medium malloc()" I would expect
> it to improve performance somewhat, at a moderate additional memory
> requirement.
>
>> Why was this patch not applied with the appropriate
>> #if defined(_WIN32) or what is used for MSVC and mingw?


>
> Because then it wouldn't be applied to other platforms, like FreeBSD.
>
> Note thought that I added this remark for discussion about disabling it
> under -Dusemyalloc:
>
> | b) Should the new over-allocation also be used under -Dusemyalloc.  It
> |    will provide less benefit there, but also come at a lower cost because
> |    the newlen can still fall inside the already allocated bucket size
> |    anyways.  I don't have any strong opinion if it should be disabled
> |    in this situation or not.
>
> But as I stated above, I don't think it will make much of a difference either
> way for the -Dusemyalloc case, but would love to see some comprehensive
> benchmarks.

Nicholas had some impressive benchmarks from this change for AIX or
HPUX (i forget), whose realloc has the same problems as does Win32's.

Cheers,
yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

Re: Patch to make string-append on win32 100 times faster demerphq 8/20/10 8:05 AM

Ill just note that im not arguing against this patch. Just that
overallocation is not the only reason that a malloc might not be
penalized by this change.

One real-world benchmark that people might want to try would be to use
a routine like this:

sub make_tree {
  my ($depth) = shift;
  return int rand 100 unless $depth>0;
  return [ make_tree($depth-1), make_tree($depth-1) ]
}

and then use the XS implementation of Data::Dumper to dump the results
 of make_tree() for various N.

On win32 even modest N will result in the machine essentially hanging.
On no other OS that I've tried it on is the slowdown as noticeable.
This was traced to the use of realloc in SV_GROW(). This was the
analysis that lead to Nicholas' original patch.

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

RE: Patch to make string-append on win32 100 times faster Jan Dubois 8/20/10 10:43 AM
On Fri, 20 Aug 2010, demerphq wrote:
> On 20 August 2010 16:53, demerphq <deme...@gmail.com> wrote:
>> On 16 August 2010 08:19, Jan Dubois <ja...@activestate.com> wrote:
>>> Could you provide some evidence for this claim?  The only way a
>>> "better malloc" can prevent this slowdown is by doing some kind of
>>> overallocation itself.
>>
>> This is not correct. Mallocs/reallocs that can merge blocks do not
>> have the performance penalty that this algorithm seeks to work
>> around. The problem here is that the Win32 realloc always copies, and
>> thus extending a block a character at a time becomes exponential.
>> With a realloc that merges blocks and only copies where there is
>> insufficient contiguous blocks does not have this problem.
>
> Ill just note that im not arguing against this patch. Just that
> overallocation is not the only reason that a malloc might not be
> penalized by this change.

Indeed, "the only way" was a rather poor choice of words.

Note though that all the block merge algorithms still depends heavily
on the allocation pattern.  If there are no free blocks to merge, then
they still provide poor performance, and that case *will* be helped
by the overallocation during the growth phase.  This has been shown
for both FreeBSD and the Linux allocators.

> One real-world benchmark that people might want to try would be to use
> a routine like this:
>
> sub make_tree {
>   my ($depth) = shift;
>   return int rand 100 unless $depth>0;
>   return [ make_tree($depth-1), make_tree($depth-1) ]
> }
>
> and then use the XS implementation of Data::Dumper to dump the results
> of make_tree() for various N.
>
> On win32 even modest N will result in the machine essentially hanging.
> On no other OS that I've tried it on is the slowdown as noticeable.
> This was traced to the use of realloc in SV_GROW(). This was the
> analysis that lead to Nicholas' original patch.

I don't see how sv_grow() comes into this example at all.  Are you talking
about av_expand()?

That is indeed an area that should probably be scrutinized for potential
Windows performance improvements too.  I also have some vague memory that
hash expansion beyond something like 100,000 keys gets extremely slow on
Windows.

Cheers,
-Jan

Re: Patch to make string-append on win32 100 times faster demerphq 8/20/10 11:59 AM
On 20 August 2010 19:43, Jan Dubois <ja...@activestate.com> wrote:
> On Fri, 20 Aug 2010, demerphq wrote:
>> On 20 August 2010 16:53, demerphq <deme...@gmail.com> wrote:
>>> On 16 August 2010 08:19, Jan Dubois <ja...@activestate.com> wrote:
>>>> Could you provide some evidence for this claim?  The only way a
>>>> "better malloc" can prevent this slowdown is by doing some kind of
>>>> overallocation itself.
>>>
>>> This is not correct. Mallocs/reallocs that can merge blocks do not
>>> have the performance penalty that this algorithm seeks to work
>>> around. The problem here is that the Win32 realloc always copies, and
>>> thus extending a block a character at a time becomes exponential.
>>> With a realloc that merges blocks and only copies where there is
>>> insufficient contiguous blocks does not have this problem.
>>
>> Ill just note that im not arguing against this patch. Just that
>> overallocation is not the only reason that a malloc might not be
>> penalized by this change.
>
> Indeed, "the only way" was a rather poor choice of words.
>
> Note though that all the block merge algorithms still depends heavily
> on the allocation pattern.  If there are no free blocks to merge, then
> they still provide poor performance, and that case *will* be helped
> by the overallocation during the growth phase.  This has been shown
> for both FreeBSD and the Linux allocators.

IMO this is "just" common sense. If we call realloc/malloc less often
then the underlying code will be faster.  Given that the perl
tradition is to trade memory for speed in most situations this patch
makes sense -- and it /also/ provides a vector for us to provide a

  use less ''memory';

which would control how much we preallocate.

>> One real-world benchmark that people might want to try would be to use
>> a routine like this:
>>
>> sub make_tree {
>>   my ($depth) = shift;
>>   return int rand 100 unless $depth>0;
>>   return [ make_tree($depth-1), make_tree($depth-1) ]
>> }
>>
>> and then use the XS implementation of Data::Dumper to dump the results
>> of make_tree() for various N.
>>
>> On win32 even modest N will result in the machine essentially hanging.
>> On no other OS that I've tried it on is the slowdown as noticeable.
>> This was traced to the use of realloc in SV_GROW(). This was the
>> analysis that lead to Nicholas' original patch.
>
> I don't see how sv_grow() comes into this example at all.  Are you talking
> about av_expand()?

Actually no, im talking about sv_grow(). Dont forget I said to pass
the output of make_tree() into the XS version of Data::Dumper, which
will stringify the resulting binary tree by extending a buffer
essentially character by character (actually token by token, but this
example is pretty close to char by char) to add open brackets,
integers, and close brackets. It uses sv_grow to keep growing the
buffer as it does so. Interestingly on Windows you can often make
Data::Dumper go faster by NOT using the XS version by forcing it to
use the pure perl version via $Data::Dumper::Useqq or more recent
equivalents - im not sure exactly why this makes a difference, I never
dug, however Nicholas was able to demonstrate a huge speedup on some
platforms with his original hack along these lines..  Basically it
turned out that DD was calling realloc() pretty much once per
character emitted while dumping. By preallocating an extra 10% the
number of realloc calls was reduced dramatically.

Also, note that av_expand() is never called in my example as all
arrays are preallocated with 8 elements (er, i think - certainly more
than 2) and my code uses only 2 per array. The point here is that the
code will produce *many* arrays, and will require DD to do a lot of
concatenation to do the dump.

> That is indeed an area that should probably be scrutinized for potential
> Windows performance improvements too.  I also have some vague memory that
> hash expansion beyond something like 100,000 keys gets extremely slow on
> Windows.

Sounds interesting, but its not the case I meant, i was strictly
referring to the way the XS version of Data::Dumper builds the
stringified version of the object being dumped.

Cheers,
yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

Re: Patch to make string-append on win32 100 times faster demerphq 8/20/10 12:13 PM
On 14 August 2010 02:39, Jan Dubois <ja...@activestate.com> wrote:
> On Fri, 30 Jul 2010, Wolfram Humann wrote:
>
> [...]
>
>> I would like to propose that perl requests an additional amount of
>> memory from realloc when strings grow by a small amount. After
>> discussion on c.l.p.m. and with the help of ideas from Ilya Zakharevich
>> and Peter J. Holzer my proposal reads like this:
>> (I'm not a regular user of diff and using it wrong might do more harm
>> than good)
>>
>> In sv.c the following section in Perl_sv_grow():
>>
>>      if (newlen > SvLEN(sv)) {        /* need more room? */
>> #ifndef Perl_safesysmalloc_size
>>      newlen = PERL_STRLEN_ROUNDUP(newlen);
>> #endif
>>
>> should be expanded to:
>>
>> #ifndef PERL_STRLEN_EXPAND_SHIFT
>> #  define PERL_STRLEN_EXPAND_SHIFT 2
>> #endif
>>
>>      if (newlen > SvLEN(sv)) {        /* need more room? */
>>      size_t minlen  = SvCUR(sv);
>>      minlen += (minlen >> PERL_STRLEN_EXPAND_SHIFT) + 10;
>>          if (newlen < minlen) newlen = minlen;
>> /*        newlen += (newlen >> 1) + 10; */
>> #ifndef Perl_safesysmalloc_size
>>      newlen = PERL_STRLEN_ROUNDUP(newlen);
>> #endif
>
> Thank you for your patch!  I've now committed a slightly cleaned up version
> to the Perl repository, and added your email address to the AUTHORS file:
>
> http://perl5.git.perl.org/perl.git/commitdiff/f12005599f648e675af22dfef1047191e260bc48
> http://perl5.git.perl.org/perl.git/commitdiff/1fd8fa480e535be8885c2c1b2e7e95c3aabbdf7f
>
> Beyond formatting changes I did:
>
> * change size_t to STRLEN for internal consistency
> * removed the commented out code
> * moved the PERL_STRLEN_EXPAND_SHIFT definition to perl.h

>
> The discussion of this change seemed to have stalled, but I see
> +1 votes from David Golden and Marvin Humphrey, with additional
> information from Ben Morrow that the patch also helps on FreeBSD
> (when building without -Dusemymalloc), with nobody voicing
> any disagreement about applying the patch.
>
> I've kept the original constants (PERL_STRLEN_EXPAND_SHIFT being
> 2, plus the minimum addition of 10).  While the shift overallocates
> by 25%, this is based on the original length of the buffer, not
> the requested new length. The reported 12.5% overallocation used
> by Python (for arrays, not strings) is based on the new size already,
> not the old.
>
> The advantage of this algorithm is also that we don't overallocate
> at all if the buffer already grows by more than 25%, so we won't
> "waste" any space if 2 long strings are concatenated, only small
> strings are repeatedly concatenated to an already long one.
>
> I don't care much either way about the constant 10, but the price
> for it is small: on a 32-bit system a null-string SV already
> occupies 28 bytes (24 + 1*4), so once it grows to 32 bytes,
> the "10" will actually expand it to 36 (24 + 3*4) bytes instead,
> which doesn't matter either way.
>
> Things that could benefit from further discussion are:
>
> a) the algorithm in sv_grow() for expanding SVs with SvOOK
>
>        if (newlen > SvLEN(sv))
>            newlen += 10 * (newlen - SvCUR(sv)); /* avoid copy each time */
>
>   This doesn't make much sense, and is catastrophic when
>   you append a very long string to a short one.  I have not looked
>   into the heritage of this algorithm yet.

I suspect this was Nicholas'es hack to reduce the number of times
realloc is called when appending single characters to a buffer over
and over.

One thing I have a bit of hesitancy about is it seems the benchmarks
here, (emphasis on seems i havent looked very closely unfortunately)
is that they /look/ like they involve concatenating large strings
together. These types of benchmarks will "blow" the merge block
algorithm used in some reallocs (depending on memory pressures, etc on
the system at the time), the original thought was to address the
performance of code like this:

my $str="";
$str.="x" for 1...100000;

If we dont overallocate AND we are using a malloc/realloc that doesnt
do block merges (like Windows worked when I last used Windows) you end
up getting quadratic performance as each concatenation ends up being a
realloc call, which in turn ends up doing a memcopy(). Which means one
ends up doing a ridiculously large amount of data copying.

If i recall correctly Nicholas made a comment along the lines of "best
performance gain for least effort" of any patch he had ever written.

Cheers,
yves
--
perl -Mre=debug -e "/just|another|perl|hacker/"

RE: Patch to make string-append on win32 100 times faster Jan Dubois 8/20/10 12:43 PM
On Fri, 20 Aug 2010, demerphq wrote:
> On 14 August 2010 02:39, Jan Dubois <ja...@activestate.com> wrote:
> > Things that could benefit from further discussion are:
> >
> > a) the algorithm in sv_grow() for expanding SVs with SvOOK
> >
> >        if (newlen > SvLEN(sv))
> >            newlen += 10 * (newlen - SvCUR(sv)); /* avoid copy each time */
> >
> >   This doesn't make much sense, and is catastrophic when
> >   you append a very long string to a short one.  I have not looked
> >   into the heritage of this algorithm yet.
>
> I suspect this was Nicholas'es hack to reduce the number of times
> realloc is called when appending single characters to a buffer over
> and over.

This code is only executed when the SvOOK bit is set (which means that
some bytes at the beginning of the allocated block have been chopped
off).  Are you sure that applies to the Data::Dumper::XS case (yes, I'm
too lazy^H^H^H^Hbusy to hunt through Dumper.xs right now).


 
> One thing I have a bit of hesitancy about is it seems the benchmarks
> here, (emphasis on seems i havent looked very closely unfortunately)
> is that they /look/ like they involve concatenating large strings
> together. These types of benchmarks will "blow" the merge block
> algorithm used in some reallocs (depending on memory pressures, etc on
> the system at the time), the original thought was to address the
> performance of code like this:
>
> my $str="";
> $str.="x" for 1...100000;

No the benchmarks in this thread are all about concatenating small strings
to large buffers.

Just to reiterate it once more: the "new" patch will not kick in *at all*
if the required new size is more than (25% + 10 bytes) larger than the
old size.  It only helps repeated appends of smallish increments.

> If we dont overallocate AND we are using a malloc/realloc that doesnt
> do block merges (like Windows worked when I last used Windows) you end
> up getting quadratic performance as each concatenation ends up being a
> realloc call, which in turn ends up doing a memcopy(). Which means one
> ends up doing a ridiculously large amount of data copying.

Yes, that is exactly the problem this change is supposed to "solve".


 
> If i recall correctly Nicholas made a comment along the lines of "best
> performance gain for least effort" of any patch he had ever written.

Could you hunt down the original discussion (and commit id) of Nicholas
patch?  I don't quite understand how we can still get such huge improvements
if this is supposedly already fixed.

The one patch from Nicholas that I know about is that under -Dusemymalloc
he queries the size of the new block and assigns that number to SvLEN
instead of the requested size.  That prevents additional calls to sv_grow()
when the block wouldn't have to move because realloc() already overallocated.

But that patch does nothing for other allocators.

Cheers,
-Jan

Re: Patch to make string-append on win32 100 times faster demerphq 8/21/10 2:18 AM

Ben, Wolfram, any chance you can try benchmarking this with and
without the new patch?

cheers,
Yves


--
perl -Mre=debug -e "/just|another|perl|hacker/"

Re: Patch to make string-append on win32 100 times faster Wolfram Humann 8/21/10 3:52 AM
-------- Original Message --------
Subject: Re: Patch to make string-append on win32 100 times faster
For a patched win32 perl I'll have to wait till Monday. What I can
compare now is Strawberry (usemymalloc=n) to Cygwin (usemymalloc=y).
This slightly golfed version

perl -MData::Dumper -E"sub mt{my $d=shift;$d>0?[mt($d-1),mt($d-1)]:int
rand 100}; say length Dumper mt 18"
34576966

needs approx. 6 seconds to print and 8 seconds to finish on both of
these. No noticeable difference. With a touch of horror I have to report
that the printed stringlengths are slightly non-deterministic from run
to run...???

Wolfram

Re: Patch to make string-append on win32 100 times faster demerphq 8/21/10 4:46 AM
On 21 August 2010 12:52, Wolfram Humann <w.c.h...@arcor.de> wrote:
> For a patched win32 perl I'll have to wait till Monday. What I can compare
> now is Strawberry (usemymalloc=n) to Cygwin (usemymalloc=y). This slightly
> golfed version

heh. :-)

> perl -MData::Dumper -E"sub mt{my $d=shift;$d>0?[mt($d-1),mt($d-1)]:int rand
> 100}; say length Dumper mt 18"
> 34576966
>
> needs approx. 6 seconds to print and 8 seconds to finish on both of these.

Interesting, how does that compare for you if you stick a

  $Data::Dumper::Useqq=1;

before the dump?

> No noticeable difference. With a touch of horror I have to report that the
> printed stringlengths are slightly non-deterministic from run to run...???

:-)

No worries there, the code does use rand() after all, so we expect to
see slightly different results from run to run (or the srand
initialization would be broken). Add a srand(1) or change the int rand
100 to a constant and it should produce repeatable results.

Anyway, thanks, i recall code like this taking unacceptably long (iow,
killed before it stopped) on Win2k with the XS version, and it taking
just-barely-acceptably long with the PP version (enabled by the Useqq
option as the XS code doesnt know how to do double-quoting.)

Cheers,
yves
--
perl -Mre=debug -e "/just|another|perl|hacker/"

Re: Patch to make string-append on win32 100 times faster demerphq 8/21/10 5:18 AM
On 21 August 2010 14:13, demerphq <deme...@gmail.com> wrote:> Im pretty sure its this one:
>
> $ git log -1  1936d2a74fbc35dd0d686
> commit 1936d2a74fbc35dd0d6862386c9988d92bb8e6e0
> Author: Nicholas Clark <ni...@ccl4.org>
> Date:   Wed Jun 1 20:46:02 2005 +0000
>
>    Round up all string length requests to malloc()/realloc() to the next
>    multiple of 4/8 bytes [sizeof(size_t)] on the assumption that malloc()
>    internally will quantise, and so we're going to use space that
>    otherwise would be wasted. Hopefully this will save realloc()ing.
>
>    p4raw-id: //depot/perl@24665
>
> But i cant find the mail about it. I do recall some of the discussion
> was on IRC, but I also remember a mail. Unfortunately (for some not
> very unfortunate definition of unfortunate :-) Nicholas improves
> performance all the time, so the keywords I keep looking for in my
> mailbox give a rather large number of hits and ive yet to find the
> right one. Perhaps he remembers?

Right after this I tried a different search criteria and while it
didnt find the mail I vaguely remember it does seem somewhat relevent:


Avoiding realloc

 Nicholas Clark had a look at way perl allocates memory for scalars. Up
 until now the exact size has always been requested, and the allocated
 size is recorded. This helps avoid unnecessary trips to the heap when
 a string shrinks and then grows yet stays within the size of the
 original allocation.

 On the assumption that "malloc" implementations silently round up
 anyway, because it simplifies the internal bookkeeping, it makes sense
 to round up the requested size silently on perl's side anyway, because
 then scalars that creep a little bit past the initial requested size
 might still remain within the initial requested size, thereby avoiding
 a costly "realloc".

 Nicholas then set about trying to determine the exact number of bytes
 that are in fact allocated for all small allocation values (where
 small means between 1 and 256). It turns out that a brute force "round
 up to nearest multiple of 4 on a 32-bit CPU" rule wasn't particularly
 effective. "malloc" sometimes allocates much more, especially on
 FreeBSD. Nicholas wanted to know whether it would be worth the effort
 to probe for these sizes at "Configure" time, build a lookup table,
 and tweak the already monstrous "PERL_STRLEN_ROUNDUP" macro to use it
 for short allocation sizes.

 Nicholas simply wanted to minimise the number of calls made to
 "sv_grow", without requiring any extra code or tests elsewhere in the
 interpreter. Gisle was unconvinced, reasoning that it takes memory
 away from "malloc" that "malloc" may be able to deploy more
 effectively elsewhere at some further point in time.

   Use a lookup table to save memory
   http://xrl.us/jqvw

--
perl -Mre=debug -e "/just|another|perl|hacker/"

Re: Patch to make string-append on win32 100 times faster demerphq 8/21/10 5:13 AM
On 20 August 2010 21:43, Jan Dubois <ja...@activestate.com> wrote:

Im pretty sure its this one:

$ git log -1  1936d2a74fbc35dd0d686
commit 1936d2a74fbc35dd0d6862386c9988d92bb8e6e0
Author: Nicholas Clark <ni...@ccl4.org>
Date:   Wed Jun 1 20:46:02 2005 +0000

    Round up all string length requests to malloc()/realloc() to the next
    multiple of 4/8 bytes [sizeof(size_t)] on the assumption that malloc()
    internally will quantise, and so we're going to use space that
    otherwise would be wasted. Hopefully this will save realloc()ing.

    p4raw-id: //depot/perl@24665

But i cant find the mail about it. I do recall some of the discussion
was on IRC, but I also remember a mail. Unfortunately (for some not
very unfortunate definition of unfortunate :-) Nicholas improves
performance all the time, so the keywords I keep looking for in my
mailbox give a rather large number of hits and ive yet to find the
right one. Perhaps he remembers?

cheers
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

Re: Patch to make string-append on win32 100 times faster Wolfram Humann 8/24/10 10:19 AM
On 21.08.2010 13:46, demerphq wrote:
> On 21 August 2010 12:52, Wolfram Humann<w.c.h...@arcor.de>  wrote:
>    
>
>> perl -MData::Dumper -E"sub mt{my $d=shift;$d>0?[mt($d-1),mt($d-1)]:int rand
>> 100}; say length Dumper mt 18"
>> 34576966
>>
>> needs approx. 6 seconds to print and 8 seconds to finish on both of these.
>>      
> Interesting, how does that compare for you if you stick a
>
>    $Data::Dumper::Useqq=1;
>
> before the dump?

Some more results, wristwatch timed -> ignore minor differences:

N=20
strawberry:    27 seconds
with my patch: 29 seconds
usemymalloc:   10 seconds

With $Data::Dumper::Useqq=1 I had to reduce N to 19 to avoid
out-of-memory. Strawberryperl unpatched only.

Without $Data::Dumper::Useqq=1: 10 seconds,  462 MB
With $Data::Dumper::Useqq=1: 28 seconds, 1100 MB

>> No noticeable difference. With a touch of horror I have to report that the
>> printed stringlengths are slightly non-deterministic from run to run...???
>>      
> :-)
>
> No worries there, the code does use rand() after all, so we expect to
> see slightly different results from run to run (or the srand
> initialization would be broken). Add a srand(1) or change the int rand
> 100 to a constant and it should produce repeatable results.
>

O.k. so no need to be worried about the code. But maybe I need to be
worried about a mind that is obvously able to type "rand" one moment and
complain about the code running non-deterministic about 5 minutes later ;-)

I'll be on vacation soon. Won't be able to respond to news in this
thread quickly (maybe not at all) for the weeks to come.

Wolfram

Re: Patch to make string-append on win32 100 times faster demerphq 8/24/10 10:39 AM
On 24 August 2010 19:19, Wolfram Humann <w.c.h...@arcor.de> wrote:
> On 21.08.2010 13:46, demerphq wrote:
>>
>> On 21 August 2010 12:52, Wolfram Humann<w.c.h...@arcor.de>  wrote:
>>
>>>
>>> perl -MData::Dumper -E"sub mt{my $d=shift;$d>0?[mt($d-1),mt($d-1)]:int
>>> rand
>>> 100}; say length Dumper mt 18"
>>> 34576966
>>>
>>> needs approx. 6 seconds to print and 8 seconds to finish on both of
>>> these.
>>>
>>
>> Interesting, how does that compare for you if you stick a
>>
>>   $Data::Dumper::Useqq=1;
>>
>> before the dump?
>
> Some more results, wristwatch timed -> ignore minor differences:
>
> N=20
> strawberry:    27 seconds
> with my patch: 29 seconds
> usemymalloc:   10 seconds
>
> With $Data::Dumper::Useqq=1 I had to reduce N to 19 to avoid out-of-memory.
> Strawberryperl unpatched only.
>
> Without $Data::Dumper::Useqq=1: 10 seconds,  462 MB
> With $Data::Dumper::Useqq=1: 28 seconds, 1100 MB

Interesting...

>>> No noticeable difference. With a touch of horror I have to report that
>>> the
>>> printed stringlengths are slightly non-deterministic from run to
>>> run...???
>>>
>>
>> :-)
>>
>> No worries there, the code does use rand() after all, so we expect to
>> see slightly different results from run to run (or the srand
>> initialization would be broken). Add a srand(1) or change the int rand
>> 100 to a constant and it should produce repeatable results.
>>
>
> O.k. so no need to be worried about the code. But maybe I need to be worried
> about a mind that is obvously able to type "rand" one moment and complain
> about the code running non-deterministic about 5 minutes later ;-)

I suspect your holiday will resolve any concerns you might have in
this regard....

> I'll be on vacation soon. Won't be able to respond to news in this thread
> quickly (maybe not at all) for the weeks to come.

Have fun... And dont worry about this thread, seems to me the point is proved...

cheers
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

More topics »