Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Speed of reading some MB of data using qx(...)

11 views
Skip to first unread message

Wolfram Humann

unread,
Jul 22, 2010, 9:02:32 AM7/22/10
to
I have a program that processes PDF files by converting them to
Postscript, read the ps and do something with it. I use pdftops (from
xpdf) for the pdf->ps conversion and retrieve the result like this:

$ps_text = qx( pdftops $infile - );

On win32 using strawberry perl (tried 5.10 and 5.12) this takes much
more time than I expected so I did a test and first converted the PDF
to Postscript, then read the Postscript (about 12 MB) like this (cat
on win32 provided by cygwin):

perl -E" $t = qx(cat psfile.ps); say length $t "

This takes about 16 seconds on win32 but only <1 seconds on Linux. I
was afraid that this might be a 'binmode' problem so I also tried
this:

perl -E" open $in,'cat psfile.ps |'; binmode $in; local $/; $t=<$in>;
say length $t "

But the effect is the same: fast on linux, slow on win32. Besides
bashing win32 :-) and ideas for reason and (possibly) cure?

Wolfram

Uri Guttman

unread,
Jul 22, 2010, 12:39:42 PM7/22/10
to
>>>>> "WH" == Wolfram Humann <w.c.h...@arcor.de> writes:

WH> perl -E" $t = qx(cat psfile.ps); say length $t "

WH> This takes about 16 seconds on win32 but only <1 seconds on Linux. I
WH> was afraid that this might be a 'binmode' problem so I also tried
WH> this:

WH> perl -E" open $in,'cat psfile.ps |'; binmode $in; local $/; $t=<$in>;
WH> say length $t "

WH> But the effect is the same: fast on linux, slow on win32. Besides
WH> bashing win32 :-) and ideas for reason and (possibly) cure?

you also bashed File::Slurp in a bug report you sent me. obviously it is
a winblows issue. one possibility is that winblows does poor process
context switching and piping between processes causes lots of
that. there may be other reasons but i stay away from redmond.

uri

--
Uri Guttman ------ u...@stemsystems.com -------- http://www.sysarch.com --
----- Perl Code Review , Architecture, Development, Training, Support ------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------

Ilya Zakharevich

unread,
Jul 22, 2010, 5:14:27 PM7/22/10
to
On 2010-07-22, Uri Guttman <u...@StemSystems.com> wrote:
> WH> But the effect is the same: fast on linux, slow on win32. Besides
> WH> bashing win32 :-) and ideas for reason and (possibly) cure?
>
> you also bashed File::Slurp in a bug report you sent me. obviously it is
> a winblows issue. one possibility is that winblows does poor process
> context switching and piping between processes causes lots of
> that. there may be other reasons but i stay away from redmond.

Disable your network adapters, and switch off virus checking. Recheck
speed.

Hope this helps,
Ilya

Wolfram Humann

unread,
Jul 23, 2010, 4:53:04 AM7/23/10
to
On Jul 22, 6:39 pm, "Uri Guttman" <u...@StemSystems.com> wrote:

> >>>>> "WH" == Wolfram Humann <w.c.hum...@arcor.de> writes:
>
>   WH> perl -E" $t = qx(cat psfile.ps); say length $t "
>
>   WH> This takes about 16 seconds on win32 but only <1 seconds on Linux. I
>   WH> was afraid that this might be a 'binmode' problem so I also tried
>   WH> this:
>
>   WH> perl -E" open $in,'cat psfile.ps |'; binmode $in; local $/; $t=<$in>;
>   WH> say length $t "
>
>   WH> But the effect is the same: fast on linux, slow on win32. Besides
>   WH> bashing win32 :-) and ideas for reason and (possibly) cure?
>
> you also bashed File::Slurp in a bug report you sent me. obviously it is
> a winblows issue. one possibility is that winblows does poor process
> context switching and piping between processes causes lots of
> that. there may be other reasons but i stay away from redmond.
>
> uri

Uri,
it was certainly not my intention to bash File::Slurp in my bug report
-- and I when I re-read what I wrote I don't think I did. If I did
bash, I am sorry for it (and would be grateful for explanation from
your side). I'm not a native speaker; maybe things came out different
from what I intended.
Besides that, I think the two issues (my bug report and my post here)
are unrelated. It looks like File::Slurp needs to do substitutions on
win32. I was not aware of that as I do not seem to need them when I
read line-by-line on win32. Those substitutions naturally suffer from
the use of $& elsewhere in the program. I think it would be o.k. to
mention this briefly in the docs for File ::Slurp. This is a special
case and you might disagree that it should be in the docs. In that
case, just reject and close my bug report.

Regards,
Wolfram

Uri Guttman

unread,
Jul 23, 2010, 5:00:02 AM7/23/10
to
>>>>> "WH" == Wolfram Humann <w.c.h...@arcor.de> writes:

WH> On Jul 22, 6:39 pm, "Uri Guttman" <u...@StemSystems.com> wrote:

>> you also bashed File::Slurp in a bug report you sent me. obviously it is
>> a winblows issue. one possibility is that winblows does poor process
>> context switching and piping between processes causes lots of
>> that. there may be other reasons but i stay away from redmond.

WH> it was certainly not my intention to bash File::Slurp in my bug report
WH> -- and I when I re-read what I wrote I don't think I did. If I did
WH> bash, I am sorry for it (and would be grateful for explanation from
WH> your side). I'm not a native speaker; maybe things came out different
WH> from what I intended.
WH> Besides that, I think the two issues (my bug report and my post here)
WH> are unrelated. It looks like File::Slurp needs to do substitutions on
WH> win32. I was not aware of that as I do not seem to need them when I
WH> read line-by-line on win32. Those substitutions naturally suffer from
WH> the use of $& elsewhere in the program. I think it would be o.k. to
WH> mention this briefly in the docs for File ::Slurp. This is a special
WH> case and you might disagree that it should be in the docs. In that
WH> case, just reject and close my bug report.

i don't think i should worry about someone else using $& in their
code. most modules use s/// with grabbing and will suffer for the same
reason. this is a silly idea to warn about something i have no control
over.

but you showed with your slurp code test that it was slower on winblows
without my module. so telling me to warn about using $& is wrong as it
isn't the reason for the slowdown. winblows doesn't do forks/pipe well
since it doesn't like process context switching. they push threads which
suck for other reasons. on most unix flavors process and thread context
switching are about equally fast.

Wolfram Humann

unread,
Jul 23, 2010, 5:05:14 AM7/23/10
to

Ilya,
I tried that and it makes no difference. Also, when enabled, I don't
see any CPU usage in the virus check process -- it all goes to
perl.exe.
I also tried things using cygwin perl on the same win32 pc: cygwin
perl also runs <1 second ...

Thanks for the suggestion,
Wolfram

Ben Morrow

unread,
Jul 25, 2010, 6:10:53 PM7/25/10
to

Quoth Wolfram Humann <w.c.h...@arcor.de>:

Win32's pipes are *really* slow. Write it to a temporary and then read
the file normally in perl.

Ben

Wolfram Humann

unread,
Jul 26, 2010, 3:24:07 AM7/26/10
to
On Jul 26, 12:10 am, Ben Morrow <b...@morrow.me.uk> wrote:
>
> Win32's pipes are *really* slow. Write it to a temporary and then read
> the file normally in perl.

After further experiments I am now convinced that pipes are not the
bottleneck (even on win32 piping 10MB can't take 20 seconds...). The
problem seems to be Straberry's memory management for long strings. I
need to do some more benchmarking and will report this issue
separately.

Does anybody know of the appropriate place to report Strawberry Perl
specific bugs?

Wolfram

Ben Morrow

unread,
Jul 26, 2010, 5:39:23 AM7/26/10
to

Quoth Wolfram Humann <w.c.h...@arcor.de>:

I seriously doubt the issue is with Strawberry specifically; almost
certainly any issue applies Win32 perl in general and should be reported
to p5p with perlbug. If you can confirm it is specific to Strawberry
(so, e.g., a self-compiled mingw perl *doesn't* have the problem) then I
think the correct place is the Perl::Dist::Strawberry queue on
rt.cpan.org (mail bug-Perl-Dis...@rt.cpan.org).

(IIRC malloc on Win32 is *also* known to be deadly slow, and also IIRC
it's impossible to use perl's malloc without breaking things...)

Ben

Wolfram Humann

unread,
Jul 26, 2010, 6:07:09 AM7/26/10
to
On Jul 26, 11:39 am, Ben Morrow <b...@morrow.me.uk> wrote:
>
> I seriously doubt the issue is with Strawberry specifically; almost
> certainly any issue applies Win32 perl in general and should be reported
> to p5p with perlbug. If you can confirm it is specific to Strawberry
> (so, e.g., a self-compiled mingw perl *doesn't* have the problem) then I
> think the correct place is the Perl::Dist::Strawberry queue on
> rt.cpan.org (mail bug-Perl-Dist-Strawbe...@rt.cpan.org).

>
> (IIRC malloc on Win32 is *also* known to be deadly slow, and also IIRC
> it's impossible to use perl's malloc without breaking things...)
>
> Ben

Thanks for the pointers. My comparison is Strawberry Perl (5.10 and
5.12) against Cygwin Perl on the same machine. The latter (as well as
Perl on Linux) doesn't have the issues I see. Is that a sufficient
"proof" for the issues being Strawberry specific?

Wolfram

Peter J. Holzer

unread,
Jul 26, 2010, 7:24:52 AM7/26/10
to
On 2010-07-26 10:07, Wolfram Humann <w.c.h...@arcor.de> wrote:
> On Jul 26, 11:39 am, Ben Morrow <b...@morrow.me.uk> wrote:
>> I seriously doubt the issue is with Strawberry specifically; almost
>> certainly any issue applies Win32 perl in general and should be reported
>> to p5p with perlbug. If you can confirm it is specific to Strawberry
>> (so, e.g., a self-compiled mingw perl *doesn't* have the problem) then I
>> think the correct place is the Perl::Dist::Strawberry queue on
>> rt.cpan.org (mail bug-Perl-Dist-Strawbe...@rt.cpan.org).
>>
>> (IIRC malloc on Win32 is *also* known to be deadly slow, and also IIRC
>> it's impossible to use perl's malloc without breaking things...)
>
> Thanks for the pointers. My comparison is Strawberry Perl (5.10 and
> 5.12) against Cygwin Perl on the same machine. The latter (as well as
> Perl on Linux) doesn't have the issues I see. Is that a sufficient
> "proof" for the issues being Strawberry specific?

I remember vaguely that Activestate Perl has similar issues. I think Ben
is correct that this is a problem with Win32 malloc and will affect any
Perl build which uses the Win32 malloc implementation (Cygwin probably
provides its own malloc implementation). A fix which works on both
Activestate and Strawberry would certainly be preferrable to a
Strawberry-specific one.

hp

Ben Morrow

unread,
Jul 26, 2010, 9:11:43 AM7/26/10
to

Quoth Wolfram Humann <w.c.h...@arcor.de>:

> On Jul 26, 11:39 am, Ben Morrow <b...@morrow.me.uk> wrote:
> >
> > I seriously doubt the issue is with Strawberry specifically; almost
> > certainly any issue applies Win32 perl in general and should be reported
> > to p5p with perlbug. If you can confirm it is specific to Strawberry
> > (so, e.g., a self-compiled mingw perl *doesn't* have the problem) then I
> > think the correct place is the Perl::Dist::Strawberry queue on
> > rt.cpan.org (mail bug-Perl-Dist-Strawbe...@rt.cpan.org).
> >
> > (IIRC malloc on Win32 is *also* known to be deadly slow, and also IIRC
> > it's impossible to use perl's malloc without breaking things...)
>
> Thanks for the pointers. My comparison is Strawberry Perl (5.10 and
> 5.12) against Cygwin Perl on the same machine. The latter (as well as
> Perl on Linux) doesn't have the issues I see. Is that a sufficient
> "proof" for the issues being Strawberry specific?

No. As far as Perl is concerned, Cygwin is a separate OS. A fair
comparison would be with ActiveState or (as I said) with a Win32 perl
you've compiled yourself.

If the issue simply turns out to be 'Microsoft don't know how to write a
decent malloc', there is very little p5p can do about it, of course. On
most platforms perl can, and often does, use its own malloc
implementation which is optimised for perl's use (lots of tiny
allocations and deallocations all the time). This isn't possible on
Win32 unless you make a custom build of perl that doesn't support the
fork emulation.

Ben

Peter J. Holzer

unread,
Jul 26, 2010, 11:12:19 AM7/26/10
to
On 2010-07-26 13:11, Ben Morrow <b...@morrow.me.uk> wrote:
> If the issue simply turns out to be 'Microsoft don't know how to write a
> decent malloc', there is very little p5p can do about it, of course. On
> most platforms perl can, and often does, use its own malloc
> implementation which is optimised for perl's use (lots of tiny
> allocations and deallocations all the time). This isn't possible on
> Win32 unless you make a custom build of perl that doesn't support the
> fork emulation.

Since the fork emulation works with Win32 malloc, I think it should be
possible to write a custom malloc based on Win32 malloc (or the
underlying API calls) which still works with the fork emulation but is
faster. But it's probably not easy or somebody would have done it
already (I don't pretend to understand either memory allocation or the
fork emulation on windows).

hp

Wolfram Humann

unread,
Jul 26, 2010, 12:48:52 PM7/26/10
to
On Jul 26, 3:11 pm, Ben Morrow <b...@morrow.me.uk> wrote:
> Quoth Wolfram Humann <w.c.hum...@arcor.de>:

> > Thanks for the pointers. My comparison is Strawberry Perl (5.10 and
> > 5.12) against Cygwin Perl on the same machine. The latter (as well as
> > Perl on Linux) doesn't have the issues I see. Is that a sufficient
> > "proof" for the issues being Strawberry specific?
>
> No. As far as Perl is concerned, Cygwin is a separate OS. A fair
> comparison would be with ActiveState or (as I said) with a Win32 perl
> you've compiled yourself.

Oh dear, you're right: ActiveState Perl is just as bad as Strawberry.
Here's my test-case:
(I always append a number of chunks with a total size of 1E6 chars to
an existing string,
but the start-size of the existing string and the chunk-size vary)


use strict;
use warnings;
use Time::HiRes qw(time);

my $c1E1 = '#' x 1E1;
my $c1E2 = '#' x 1E2;
my $c1E3 = '#' x 1E3;
my $c1E4 = '#' x 1E4;
my $c1E5 = '#' x 1E5;


my $str1 = '#' x 1E5;
my $str2 = '#' x 1E6;
my $str3 = '#' x 1E7;

my $str4 = '#' x 1E7;
my $str5 = '#' x 1E7;
my $str6 = '#' x 1E7;
my $str7 = '#' x 1E7;
my $str8 = '#' x 1E7;

my $str9 = '#' x 2E7;
$str9 = '#' x 1E7;

my @ar1 = map{ $c1E2 } 1..1E5;

my @c = (
'1E5 chars + 1E4 x 1E2 chars' => sub{ $str1 .= $c1E2 for 1..1E4 },
'1E6 chars + 1E4 x 1E2 chars' => sub{ $str2 .= $c1E2 for 1..1E4 },
'1E7 chars + 1E4 x 1E2 chars' => sub{ $str3 .= $c1E2 for 1..1E4 },
'',
'1E7 chars + 1E5 x 1E1 chars' => sub{ $str4 .= $c1E1 for 1..1E5 },
'1E7 chars + 1E4 x 1E2 chars' => sub{ $str5 .= $c1E2 for 1..1E4 },
'1E7 chars + 1E3 x 1E3 chars' => sub{ $str6 .= $c1E3 for 1..1E3 },
'1E7 chars + 1E2 x 1E4 chars' => sub{ $str7 .= $c1E4 for 1..1E2 },
'1E7 chars + 1E1 x 1E5 chars' => sub{ $str8 .= $c1E5 for 1..1E1 },
'',
'1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars' => sub{ $str9 .=
$c1E2 for 1..1E4 },
'1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars ' => sub{ push @ar1,
$c1E2 for 1..1E4 },
);

while (@c) {
my $name = shift @c;
print("\n"), next unless $name;
my $code = shift @c;
my $t1 = time; &$code; my $t2 = time;
printf "%s: %6.1f ms\n", $name, 1000 * ($t2 - $t1);
}

##########################################################
And these are the results:

c:\cygwin\bin\perl LongStrings.pl
1E5 chars + 1E4 x 1E2 chars: 1.6 ms
1E6 chars + 1E4 x 1E2 chars: 2.4 ms
1E7 chars + 1E4 x 1E2 chars: 1.5 ms

1E7 chars + 1E5 x 1E1 chars: 11.3 ms
1E7 chars + 1E4 x 1E2 chars: 1.5 ms
1E7 chars + 1E3 x 1E3 chars: 0.9 ms
1E7 chars + 1E2 x 1E4 chars: 1.0 ms
1E7 chars + 1E1 x 1E5 chars: 0.9 ms

1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 1.2 ms
1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 5.5 ms

##########################################################
c:\strawberry\perl\bin\perl LongStrings.pl
1E5 chars + 1E4 x 1E2 chars: 94.4 ms
1E6 chars + 1E4 x 1E2 chars: 319.9 ms
1E7 chars + 1E4 x 1E2 chars: 2710.4 ms

1E7 chars + 1E5 x 1E1 chars: 2656.0 ms
1E7 chars + 1E4 x 1E2 chars: 2656.1 ms
1E7 chars + 1E3 x 1E3 chars: 2609.1 ms
1E7 chars + 1E2 x 1E4 chars: 1109.1 ms
1E7 chars + 1E1 x 1E5 chars: 118.3 ms

1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 1.2 ms
1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 6.5 ms

I compared Strawberry and ActiveState on another machine: the times
are close to each other but even longer than the ones above due to the
older hardware.

Wolfram

Ilya Zakharevich

unread,
Jul 26, 2010, 5:45:19 PM7/26/10
to

"My" malloc() (one shipped with Perl) needs only 1 syscall implemented
on a particular architecture: get some pages from the system. So the
macro should be defined on the command line of $(CC) -c malloc.c.

Hope this helps,
Ilya

Ben Morrow

unread,
Jul 26, 2010, 8:15:28 PM7/26/10
to

Quoth Ilya Zakharevich <nospam...@ilyaz.org>:

There is an option in the Win32 makefile to do exactly that, with a note
that it doesn't work with USE_IMP_SYS. I believe this has something to
do with needing a separate malloc pool for each thread (pseudo-process),
but I don't know. I would have thought, on the face of it, that it
should be possible to simply replace malloc with mymalloc wherever it
ends up being called, but I don't pretend to understand anything in
perlhost.h.

Ben

Wolfram Humann

unread,
Jul 27, 2010, 2:14:28 AM7/27/10
to
I do understand how an unefficient malloc would explain the chunk-size
dependency in the code. What I do not understand is the dependency on
the initial size of the string that I append to. Does Perl re-allocate
the complete string (at least from time to time) when I keep appending
to it? Or could there be other effects involved besides malloc speed ?
(Ideally I would profile Strawberry Perl on the C level but I don't
know how to do that...)

Wolfram

Uri Guttman

unread,
Jul 27, 2010, 2:20:25 AM7/27/10
to
>>>>> "WH" == Wolfram Humann <w.c.h...@arcor.de> writes:

WH> I do understand how an unefficient malloc would explain the chunk-size
WH> dependency in the code. What I do not understand is the dependency on
WH> the initial size of the string that I append to. Does Perl re-allocate
WH> the complete string (at least from time to time) when I keep appending
WH> to it? Or could there be other effects involved besides malloc speed ?
WH> (Ideally I would profile Strawberry Perl on the C level but I don't
WH> know how to do that...)

perl does a realloc (or its equiv) on strings which grow too big. i
believe it does a doubling each time (same as for hash buckets). if so,
a large initial size will affect things as it will mean fewer malloc
calls as it grows. if malloc is the bottleneck, this will show up.

Wolfram Humann

unread,
Jul 27, 2010, 3:18:42 AM7/27/10
to
On 27 Jul., 08:20, "Uri Guttman" <u...@StemSystems.com> wrote:

> >>>>> "WH" == Wolfram Humann <w.c.hum...@arcor.de> writes:
>
>   WH> I do understand how an unefficient malloc would explain the chunk-size
>   WH> dependency in the code. What I do not understand is the dependency on
>   WH> the initial size of the string that I append to. Does Perl re-allocate
>   WH> the complete string (at least from time to time) when I keep appending
>   WH> to it? Or could there be other effects involved besides malloc speed ?
>   WH> (Ideally I would profile Strawberry Perl on the C level but I don't
>   WH> know how to do that...)
>
> perl does a realloc (or its equiv) on strings which grow too big. i
> believe it does a doubling each time (same as for hash buckets). if so,
> a large initial size will affect things as it will mean fewer malloc
> calls as it grows. if malloc is the bottleneck, this will show up.

Unfortunately, that makes thing just more mysterious (at least for my
understanding):


> 1E5 chars + 1E4 x 1E2 chars: 94.4 ms
> 1E6 chars + 1E4 x 1E2 chars: 319.9 ms
> 1E7 chars + 1E4 x 1E2 chars: 2710.4 ms

While appending 1E6 chars (= 1E4 chunks of 1E2 chars each) to an
initial string size of 1E5 might require 3-4 reallocs, appending 1E6
chars to an initial string size of 1E7 chars involves at most 1
realloc. And (even though not measured in my sample code above) the
from-scratch en-bloc allocation of a 2E7 chars string is reasonably
fast in Strawberry Perl.

Wolfram

Wolfram Humann

unread,
Jul 27, 2010, 9:15:15 AM7/27/10
to
Alright, I did some profiling and code-reading and what I found is
something that I would consider a bug or at least fairly poor coding
practice in the core.
Opinions very welcome!

In Strawberry almost the entire time is spent in the following call
sequence:

Perl_sv_catpvn_flags -> Perl_sv_grow -> Perl_safesysrealloc -> realloc
(msvcrt)

Perl_sv_catpvn_flags (in sv.c) is documented as "Concatenates the
string onto the end of the string which is in the SV". That's what my
code does all the time. So far so good.
Perl_sv_catpvn_flags *always* calls SvGrow -> Perl_sv_grow (also in
sv.c). Perl_sv_grow then needs to decide if the string's memory is
already sufficient or really needs to grow. In the latter case,
safesysrealloc -> Perl_safesysrealloc -> realloc is called. The
interesting point is: how much memory does it request? The answer is:

newlen += 10 * (newlen - SvCUR(sv)); /* avoid copy each time */

I.e. it requests 10 times as much memory as is required for the
current append operation. So when I loop 10000 times and each time
append 100 chars to an initial string size of 10 million, the memory
grows from 10.000e6 to 10.001e6 to 10.002e6 and so on 1000 times till
it ends at 11.000e6. I can sort of confirm this to be true if I look
at the memory graph in Process Explorer: it grows smoothly (no
discernible steps), becoming incrementally slower towards the end
(because the amount of memory that needs to be copied for each realloc
increases).

Growing memory in such tiny increments is what I consider bad
practice.

By the way: I estimate the time required for each realloc to be around
3 ms for 10e6 chars, growing linearly with the amount of data -- I
consider that a fair speed and no reason to blame win32.

What happens in Cygwin? The stack-sampling profiler is of little help
because it easily misses infrequent events. I would expect that
Perl_sv_grow is called just as often as in Strawberry Perl. The
difference is that safesysrealloc does not call Perl_safesysrealloc ->
realloc, it calls Perl_realloc. And Perl_realloc (in malloc.c) seems
to have it's own logic (something with '<<' and 'LOG' and 'pow' which
I did not try to fully understand) to determine what amount of memory
it finally allocates. When I add some sleep() to the string append
process, I can see how memory grows in Process Explorer: There are 5
steps (probably corresponding to 5 calls to Perl_realloc) of growing
size when I start with 0.1e6 chars and then grow to 1.1e6 chars. When
I start with 10e6 chars and grow to 11e6 chars, there is just 1 step
in memory size. This looks like a sensible memory growth strategy to
me. It explains why Cygwin is several 100 times faster than Strawberry
Perl. It also explains why I observed during my experiments that
Cygwin Perl consistently needs more memory than Strawberry Perl -- but
that's a small price to pay for such a dramatic speedup.

Wolfram


Ben Morrow

unread,
Jul 27, 2010, 10:13:37 AM7/27/10
to

Quoth Wolfram Humann <w.c.h...@arcor.de>:

Possibly; I don't know what the rationale behind that choice was.
Certainly Perl seems to expect whatever malloc it's using to be smart
about pre-allocating extra memory and using that to satisfy reallocs.

> By the way: I estimate the time required for each realloc to be around
> 3 ms for 10e6 chars, growing linearly with the amount of data -- I
> consider that a fair speed and no reason to blame win32.

If you timed perl's own realloc, you would (I believe) find it does much
better than this. AFAICS from the code, it has a fixed set of block
sizes it actually allocates. Enlarging a block such that it doesn't go
over the block size actually allocated is *free*, not even linear in the
size of the block, since all it does is adjust the end marker. This is
the logic you are expecting sv_grow to implement, but perl has decided
that this is the allocator's responsibility.

> What happens in Cygwin? The stack-sampling profiler is of little help
> because it easily misses infrequent events. I would expect that
> Perl_sv_grow is called just as often as in Strawberry Perl. The
> difference is that safesysrealloc does not call Perl_safesysrealloc ->
> realloc, it calls Perl_realloc.

Right, so your Cygwin perl is built with -Dusemymalloc.

> And Perl_realloc (in malloc.c) seems
> to have it's own logic (something with '<<' and 'LOG' and 'pow' which
> I did not try to fully understand) to determine what amount of memory
> it finally allocates. When I add some sleep() to the string append
> process, I can see how memory grows in Process Explorer: There are 5
> steps (probably corresponding to 5 calls to Perl_realloc

No, I think not. I think Perl_realloc gets called just as often as
realloc got called with Strawberry, it's just that most of the time
realloc can return the new block without having to call sbrk (or
whatever Cygwin uses instead) and without having to do any copying.

> ) of growing
> size when I start with 0.1e6 chars and then grow to 1.1e6 chars. When
> I start with 10e6 chars and grow to 11e6 chars, there is just 1 step
> in memory size. This looks like a sensible memory growth strategy to
> me. It explains why Cygwin is several 100 times faster than Strawberry
> Perl. It also explains why I observed during my experiments that
> Cygwin Perl consistently needs more memory than Strawberry Perl -- but
> that's a small price to pay for such a dramatic speedup.

OK, I just ran your benchmark with the following perls:

5.8.8-vanilla i386-freebsd, default build options
5.8.8-malloc i386-freebsd, -Dusemymalloc

(chosen solely because this is the only matched pair of mymalloc/not
perls I have lying around) and got these results:

~/src/perl% runperl 5.8.8-vanilla realloc
1E5 chars + 1E4 x 1E2 chars: 420.3 ms
1E6 chars + 1E4 x 1E2 chars: 1043.1 ms
1E7 chars + 1E4 x 1E2 chars: 7159.0 ms

1E7 chars + 1E5 x 1E1 chars: 7590.6 ms
1E7 chars + 1E4 x 1E2 chars: 7148.9 ms
1E7 chars + 1E3 x 1E3 chars: 7158.1 ms
1E7 chars + 1E2 x 1E4 chars: 2948.6 ms
1E7 chars + 1E1 x 1E5 chars: 326.1 ms

1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 5.1 ms
1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 15.4 ms
~/src/perl% runperl 5.8.8-malloc realloc
1E5 chars + 1E4 x 1E2 chars: 18.6 ms
1E6 chars + 1E4 x 1E2 chars: 18.5 ms
1E7 chars + 1E4 x 1E2 chars: 45.3 ms

1E7 chars + 1E5 x 1E1 chars: 86.1 ms
1E7 chars + 1E4 x 1E2 chars: 7.4 ms
1E7 chars + 1E3 x 1E3 chars: 6.5 ms
1E7 chars + 1E2 x 1E4 chars: 3.1 ms
1E7 chars + 1E1 x 1E5 chars: 3.4 ms

1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 39.6 ms
1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 8.7 ms
~/src/perl%

So the difference you are seeing is precisely the difference between
using the system malloc and using perl's. (FreeBSD's malloc, unlike
Win32's, has a reputation for being rather efficient, so this lets
Microsoft off the hook.)

Would you be able to repeat these tests with 5.12.0: that is, build
(under Cygwin, if you don't have access to a Unix system) matched perls
configured with and without -Dusemymalloc, and run the test on both?
I'll try and do the same here, but I can't promise I'll have time. If
the slowdown still exists in 5.12, I think you have a good case for a
bug report. I'm not sure how possible it would be to fix, but it would
clearly be a big win under some circumstances to be able to build Win32
perl with perl's malloc.

Ben

Wolfram Humann

unread,
Jul 27, 2010, 11:03:35 AM7/27/10
to
On Jul 27, 4:13 pm, Ben Morrow <b...@morrow.me.uk> wrote:
> Would you be able to repeat these tests with 5.12.0: that is, build
> (under Cygwin, if you don't have access to a Unix system) matched perls
> configured with and without -Dusemymalloc, and run the test on both?
> I'll try and do the same here, but I can't promise I'll have time. If
> the slowdown still exists in 5.12, I think you have a good case for a
> bug report. I'm not sure how possible it would be to fix, but it would
> clearly be a big win under some circumstances to be able to build Win32
> perl with perl's malloc.

I do have a linux machine and I did comile my own perl there so I
think a could redo that (possibly easier than recompiling Perl for
Cygwin). The strange thing is that on Cygwin perl -V says
'usemymalloc=y' while the one on Linux says 'usemymalloc=n'. And on
Linux my bechmark runs everything under 12 ms. Are you certain
changing usemymalloc would have much effect there?


What I would much more *like* to try is recompile a perl (e.g.
strawberry perl) on win32 and replace

newlen += 10 * (newlen - SvCUR(sv));

with something like

newlen += 10 * (newlen - SvCUR(sv)) + 0.5 * SvCUR(sv);

(with the factor reasonably somewhere between 0.2 and 1)
but a quick attempt to follow http://perldoc.perl.org/perlwin32.html
was not successful :(

Wolfram

Ben Morrow

unread,
Jul 27, 2010, 1:07:58 PM7/27/10
to

Quoth Wolfram Humann <w.c.h...@arcor.de>:

> On Jul 27, 4:13 pm, Ben Morrow <b...@morrow.me.uk> wrote:
> > Would you be able to repeat these tests with 5.12.0: that is, build
> > (under Cygwin, if you don't have access to a Unix system) matched perls
> > configured with and without -Dusemymalloc, and run the test on both?
> > I'll try and do the same here, but I can't promise I'll have time. If
> > the slowdown still exists in 5.12, I think you have a good case for a
> > bug report. I'm not sure how possible it would be to fix, but it would
> > clearly be a big win under some circumstances to be able to build Win32
> > perl with perl's malloc.
>
> I do have a linux machine and I did comile my own perl there so I
> think a could redo that (possibly easier than recompiling Perl for
> Cygwin). The strange thing is that on Cygwin perl -V says
> 'usemymalloc=y' while the one on Linux says 'usemymalloc=n'. And on
> Linux my bechmark runs everything under 12 ms. Are you certain
> changing usemymalloc would have much effect there?

No. It's possible that glibc's malloc already behaves the way perl is
expecting it to, so using perl's malloc doesn't change the performance
much.

Given that we know Win32's malloc behaves badly, one thing to try would
be building Win32 perls without USE_IMP_SYS, but with and without
PERL_MALLOC. I will try to repeat the FreeBSD tests with 5.12, since
that seems to show the symptoms.

> What I would much more *like* to try is recompile a perl (e.g.
> strawberry perl) on win32 and replace
>
> newlen += 10 * (newlen - SvCUR(sv));
>
> with something like
>
> newlen += 10 * (newlen - SvCUR(sv)) + 0.5 * SvCUR(sv);
>
> (with the factor reasonably somewhere between 0.2 and 1)
> but a quick attempt to follow http://perldoc.perl.org/perlwin32.html
> was not successful :(

Last time I built perl on Win32 I started by installing Strawberry and
putting c:\strawberry\c\bin in %PATH%, and setting INCLUDE and LIB to
c:\strawberry\c\include and \lib respectively. That gives you a
known-good toolchain to start with. It's best to make sure you don't
have anything unnecessary in %PATH%; in particular, you mustn't have
some other copy of perl. Also remember that your build directory must
not have any spaces in its name.

Ben

jl_...@hotmail.com

unread,
Jul 27, 2010, 1:57:54 PM7/27/10
to
On Jul 22, 7:02 am, Wolfram Humann <w.c.hum...@arcor.de> wrote:
> I have a program that processes PDF files by converting them to
> Postscript, read the ps and do something with it. I use pdftops (from
> xpdf) for the pdf->ps conversion and retrieve the result like this:
>
> $ps_text = qx( pdftops $infile - );
>
> On win32 using strawberry perl (tried 5.10 and 5.12) this takes much
> more time than I expected so I did a test and first converted the PDF
> to Postscript, then read the Postscript (about 12 MB) like this (cat
> on win32 provided by cygwin):
>
> perl -E" $t = qx(cat psfile.ps); say length $t "
>
> This takes about 16 seconds on win32 but only <1 seconds on Linux.


Dear Wolfram,

I've encountered a similar problem on Strawberry Perl before.

I'm curious: Could you try "pre-allocating" the needed space to
$ps_text (or $t) before you set it? For example, try this:

perl -E "$t = ' ' x (-s 'psfile.ps'); $t = qx(cat psfile.ps); say
length $t"

See if that helps. I've found that setting my variable to the
target length BEFORE it's set to the proper string can reduce time
significantly (when it is eventually being set to its target value).
I'm not sure why this is so, but I can guess that it's because it can
avoid the time-consuming process of "growing" the string a little at a
time.

I hope this helps,

-- Jean-Luc

Peter J. Holzer

unread,
Jul 27, 2010, 5:09:34 PM7/27/10
to
On 2010-07-27 17:07, Ben Morrow <b...@morrow.me.uk> wrote:
>
> Quoth Wolfram Humann <w.c.h...@arcor.de>:
>> On Jul 27, 4:13 pm, Ben Morrow <b...@morrow.me.uk> wrote:
>> > Would you be able to repeat these tests with 5.12.0: that is, build
>> > (under Cygwin, if you don't have access to a Unix system) matched perls
>> > configured with and without -Dusemymalloc, and run the test on both?
>> > I'll try and do the same here, but I can't promise I'll have time. If
>> > the slowdown still exists in 5.12, I think you have a good case for a
>> > bug report. I'm not sure how possible it would be to fix, but it would
>> > clearly be a big win under some circumstances to be able to build Win32
>> > perl with perl's malloc.
>>
>> I do have a linux machine and I did comile my own perl there so I
>> think a could redo that (possibly easier than recompiling Perl for
>> Cygwin). The strange thing is that on Cygwin perl -V says
>> 'usemymalloc=y' while the one on Linux says 'usemymalloc=n'. And on
>> Linux my bechmark runs everything under 12 ms. Are you certain
>> changing usemymalloc would have much effect there?
>
> No. It's possible that glibc's malloc already behaves the way perl is
> expecting it to, so using perl's malloc doesn't change the performance
> much.

I'm pretty sure that GNU malloc doesn't round up to powers of two or
something like that. However, the performance difference between GNU
malloc and Perl malloc is rather small:

perl 5.12.1, default config, EGLIBC 2.11.2-2:

1E5 chars + 1E4 x 1E2 chars: 3.9 ms
1E6 chars + 1E4 x 1E2 chars: 3.8 ms
1E7 chars + 1E4 x 1E2 chars: 4.4 ms

1E7 chars + 1E5 x 1E1 chars: 28.4 ms
1E7 chars + 1E4 x 1E2 chars: 4.5 ms
1E7 chars + 1E3 x 1E3 chars: 2.6 ms
1E7 chars + 1E2 x 1E4 chars: 2.0 ms
1E7 chars + 1E1 x 1E5 chars: 1.9 ms

1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 2.0 ms
1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 4.4 ms

perl 5.12.1, usemymalloc=y, EGLIBC 2.11.2-2:

1E5 chars + 1E4 x 1E2 chars: 2.6 ms
1E6 chars + 1E4 x 1E2 chars: 3.8 ms
1E7 chars + 1E4 x 1E2 chars: 2.5 ms

1E7 chars + 1E5 x 1E1 chars: 18.8 ms
1E7 chars + 1E4 x 1E2 chars: 2.5 ms
1E7 chars + 1E3 x 1E3 chars: 0.9 ms
1E7 chars + 1E2 x 1E4 chars: 0.9 ms
1E7 chars + 1E1 x 1E5 chars: 1.1 ms

1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 1.9 ms
1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 3.4 ms

That may be accidental, though: An strace output for GNU malloc) shows
that very few reallocations actually result in a different address -
mostly the allocated area can be grown because there is nothing after
it. This is even true if two strings grow in parallel - each time one of
the strings moves it leaves a hole which the other string can grow into,
so in practice this works like a binary backoff. I guess there are
allocation patterns which spoil this effect (e.g. if you allocate lots
of small objects while growing large strings) but I haven't tried to
find them.

hp

Ilya Zakharevich

unread,
Jul 27, 2010, 8:50:53 PM7/27/10
to
On 2010-07-27, Wolfram Humann <w.c.h...@arcor.de> wrote:
> sv.c). Perl_sv_grow then needs to decide if the string's memory is
> already sufficient or really needs to grow. In the latter case,
> safesysrealloc -> Perl_safesysrealloc -> realloc is called. The
> interesting point is: how much memory does it request? The answer is:
>
> newlen += 10 * (newlen - SvCUR(sv)); /* avoid copy each time */
>
> I.e. it requests 10 times as much memory as is required for the
> current append operation. So when I loop 10000 times and each time
> append 100 chars to an initial string size of 10 million, the memory
> grows from 10.000e6 to 10.001e6 to 10.002e6 and so on 1000 times till
> it ends at 11.000e6.

Good l*rd!

The current algorithm is optimized to work in tandem with "my"
malloc(), which would round up to a certain geometric progression
anyway. So if one use as different malloc()s, one should better use

newlen += (newlen >> 4) + 10; /* avoid copy each time */

Thanks for an analysis (and report this on p5p)!

Yours,
Ilya

P.S. And the current algorithm is a disaster if you try to append a
string of length 200e6 to a short string...

Ilya Zakharevich

unread,
Jul 27, 2010, 8:59:33 PM7/27/10
to
On 2010-07-27, Ben Morrow <b...@morrow.me.uk> wrote:
>> By the way: I estimate the time required for each realloc to be around
>> 3 ms for 10e6 chars, growing linearly with the amount of data -- I
>> consider that a fair speed and no reason to blame win32.
>
> If you timed perl's own realloc, you would (I believe) find it does much
> better than this. AFAICS from the code, it has a fixed set of block
> sizes it actually allocates. Enlarging a block such that it doesn't go
> over the block size actually allocated is *free*, not even linear in the
> size of the block, since all it does is adjust the end marker. This is
> the logic you are expecting sv_grow to implement, but perl has decided
> that this is the allocator's responsibility.

Wrong analysis. *In this particular situation* one would find Perl's
realloc() as slow as M$'s one - but it would be called MUCH more
rarely. With my malloc(), I implemented an extra call:
malloced_size(). So on the 1st iteration, things would go similarly
to "their-malloc", and realloc() would be called - and your analysis
would apply.

But on the 2nd iteration of appending, Perl would discover the new
*really* allocated size for the string, and would update the buffer
size stored in SV. After this, for many iteractions the realloc() is
out-of-the-loop altogether.

BTW, IIRC, my malloc() does not store the "end marker" at all (unless
a debugging version is used). (The bucket size of short strings
depends only on which page of memory they belong to - well, actually,
half-page.)

Hope this helps,
Ilya

Ilya Zakharevich

unread,
Jul 27, 2010, 9:02:21 PM7/27/10
to
On 2010-07-27, Peter J. Holzer <hjp-u...@hjp.at> wrote:
> I'm pretty sure that GNU malloc doesn't round up to powers of two or
> something like that. However, the performance difference between GNU
> malloc and Perl malloc is rather small:

Yeah, right. Not more than 3 times. ;-)

> perl 5.12.1, default config, EGLIBC 2.11.2-2:
>
> 1E5 chars + 1E4 x 1E2 chars: 3.9 ms
> 1E6 chars + 1E4 x 1E2 chars: 3.8 ms
> 1E7 chars + 1E4 x 1E2 chars: 4.4 ms
>
> 1E7 chars + 1E5 x 1E1 chars: 28.4 ms
> 1E7 chars + 1E4 x 1E2 chars: 4.5 ms
> 1E7 chars + 1E3 x 1E3 chars: 2.6 ms
> 1E7 chars + 1E2 x 1E4 chars: 2.0 ms
> 1E7 chars + 1E1 x 1E5 chars: 1.9 ms
>
> 1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 2.0 ms
> 1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 4.4 ms
>
> perl 5.12.1, usemymalloc=y, EGLIBC 2.11.2-2:
>
> 1E5 chars + 1E4 x 1E2 chars: 2.6 ms
> 1E6 chars + 1E4 x 1E2 chars: 3.8 ms
> 1E7 chars + 1E4 x 1E2 chars: 2.5 ms
>
> 1E7 chars + 1E5 x 1E1 chars: 18.8 ms
> 1E7 chars + 1E4 x 1E2 chars: 2.5 ms
> 1E7 chars + 1E3 x 1E3 chars: 0.9 ms
> 1E7 chars + 1E2 x 1E4 chars: 0.9 ms
> 1E7 chars + 1E1 x 1E5 chars: 1.1 ms
>
> 1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 1.9 ms
> 1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 3.4 ms

Yours,
Ilya

Peter J. Holzer

unread,
Jul 28, 2010, 4:15:22 AM7/28/10
to
On 2010-07-28 01:02, Ilya Zakharevich <nospam...@ilyaz.org> wrote:
> On 2010-07-27, Peter J. Holzer <hjp-u...@hjp.at> wrote:
>> I'm pretty sure that GNU malloc doesn't round up to powers of two or
>> something like that. However, the performance difference between GNU
>> malloc and Perl malloc is rather small:
>
> Yeah, right. Not more than 3 times. ;-)

Compared to the factor of about 1000 that Wolfram and Ben observed this
is small. More importantly, growing a string to length n in constant
increments appears to be an O(n) operation with both your malloc and GNU
malloc, but O(n²) with Win32 malloc and BSD malloc.

Your malloc is certainly faster than GNU malloc. OTOH, AFAICS from
quickly scanning the comments in malloc.c, it always pads allocation to
the next power of two (minus 4) and it doesn't return unused memory back
to the OS. So it might use quite a bit more memory. Which of these
effects is more important depends on the program and is hard to
determine analytically. Which means that I should benchmark some of my
more performance-critical scripts with both allocators ;-).

hp

Wolfram Humann

unread,
Jul 28, 2010, 11:09:28 AM7/28/10
to
On Jul 28, 2:50 am, Ilya Zakharevich <nospam-ab...@ilyaz.org> wrote:

> On 2010-07-27, Wolfram Humann <w.c.hum...@arcor.de> wrote:
>
> > sv.c). Perl_sv_grow then needs to decide if the string's memory is
> > already sufficient or really needs to grow. In the latter case,
> > safesysrealloc -> Perl_safesysrealloc -> realloc is called. The
> > interesting point is: how much memory does it request? The answer is:
>
> > newlen += 10 * (newlen - SvCUR(sv)); /* avoid copy each time */
>
> > I.e. it requests 10 times as much memory as is required for the
> > current append operation. So when I loop 10000 times and each time
> > append 100 chars to an initial string size of 10 million, the memory
> > grows from 10.000e6 to 10.001e6 to 10.002e6 and so on 1000 times till
> > it ends at 11.000e6.
>
> Good l*rd!
>
> The current algorithm is optimized to work in tandem with "my"
> malloc(), which would round up to a certain geometric progression
> anyway.  So if one use as different malloc()s, one should better use
>
>   newlen += (newlen >> 4) + 10; /* avoid copy each time */

I finally managed to compile my own win32 perl. (Actually it was quite
easy once I refrained from doing mistakes so stupid I do not dare to
talk about them...)
Now I could modify Perl_sv_grow() and insert debugging prints and I
found good and bad news.

The bad news: Looks like I was *overly optimistic* (LOL!) concerning
the efficiency of the current string memory allocation on win32. The
"newlen += 10 * (newlen - SvCUR(sv))" line is only executed if
SvOOK(sv) -- i.e. in most cases it is *not* executed. Therefore win32
system realloc is not called every tenth string-append operation but
*every* time something gets appended to a string.

The good news: A single additional line of code makes win32 perl
100...1000 times faster!
(for code that appends to strings very frequently)

I went with Ilya's proposal but inserted the line a little further
down, just after
if (newlen > SvLEN(sv)) { /* need more room? */

So now we have:
if (newlen > SvLEN(sv)) { /* need more room? */
newlen += (newlen >> 2) + 10;
#ifndef Perl_safesysmalloc_size
newlen = PERL_STRLEN_ROUNDUP(newlen);
#endif
if (SvLEN(sv) && s) {
s = (char*)saferealloc(s, newlen);
}

The remaining question is by what ratio a string's memory should grow.
I tried several values from (newlen >> 0) to (newlen >> 6) for the
best compromise between execution time and memory usage and my
personal favorite is (newlen >> 2). What do others here think? At the
end of this post I will attach the results for my benchmark script
starting with Cygwin Perl followed by several versions of (newlen >>
x) and finally the unpatched Strawberry Perl. These reports now also
include memory footprint info (courtesy of pslist from the
Sysinternals suite). I also went back to my original task of reading a
12 MB postscript file using qx(cat ...) and in some cases I also
report times for that -- here Cygwin (70 ms) still beats my modified
perl (210 ms), but that's still waaaaay better than the original 18000
ms :-)

I will also report to p5p.

Wolfram

###########################################################

c:\cygwin\bin\perl d:\exe\LongStrings.pl

1E5 chars + 1E4 x 1E2 chars: 1.5 ms
1E6 chars + 1E4 x 1E2 chars: 2.3 ms


1E7 chars + 1E4 x 1E2 chars: 1.5 ms

1E7 chars + 1E5 x 1E1 chars: 12.2 ms
1E7 chars + 1E4 x 1E2 chars: 1.4 ms
1E7 chars + 1E3 x 1E3 chars: 0.6 ms
1E7 chars + 1E2 x 1E4 chars: 0.6 ms
1E7 chars + 1E1 x 1E5 chars: 0.8 ms

1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 1.2 ms

1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 5.9 ms

Private MB: 326.5
Peak Private MB: 326.5

--------------

qx(cat postscriptfile.ps): 68.7 ms
Private MB: 38.5
Peak Private MB: 38.5

###########################################################

newlen += (newlen >> 0) + 10;

C:\wh_fast_perl\bin\perl d:\exe\LongStrings.pl

1E5 chars + 1E4 x 1E2 chars: 2.2 ms
1E6 chars + 1E4 x 1E2 chars: 1.4 ms
1E7 chars + 1E4 x 1E2 chars: 1.4 ms

1E7 chars + 1E5 x 1E1 chars: 10.4 ms
1E7 chars + 1E4 x 1E2 chars: 1.4 ms
1E7 chars + 1E3 x 1E3 chars: 0.6 ms
1E7 chars + 1E2 x 1E4 chars: 0.6 ms
1E7 chars + 1E1 x 1E5 chars: 0.6 ms

1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 1.2 ms

1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 6.0 ms

Private MB: 378.3
Peak Private MB: 418.0

--------------

qx(cat postscriptfile.ps): 181.2 ms
Private MB: 25.1
Peak Private MB: 40.3

###########################################################

newlen += (newlen >> 1) + 10;

C:\wh_fast_perl\bin\perl d:\exe\LongStrings.pl

1E5 chars + 1E4 x 1E2 chars: 2.5 ms
1E6 chars + 1E4 x 1E2 chars: 2.4 ms
1E7 chars + 1E4 x 1E2 chars: 1.3 ms

1E7 chars + 1E5 x 1E1 chars: 9.6 ms
1E7 chars + 1E4 x 1E2 chars: 1.3 ms
1E7 chars + 1E3 x 1E3 chars: 0.7 ms
1E7 chars + 1E2 x 1E4 chars: 0.7 ms
1E7 chars + 1E1 x 1E5 chars: 0.6 ms

1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 1.1 ms
1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 6.4 ms

Private MB: 290.2
Peak Private MB: 319.5

###########################################################

newlen += (newlen >> 2) + 10;

C:\wh_fast_perl\bin\perl d:\exe\LongStrings.pl

1E5 chars + 1E4 x 1E2 chars: 9.2 ms
1E6 chars + 1E4 x 1E2 chars: 5.3 ms


1E7 chars + 1E4 x 1E2 chars: 1.5 ms

1E7 chars + 1E5 x 1E1 chars: 9.9 ms
1E7 chars + 1E4 x 1E2 chars: 1.4 ms
1E7 chars + 1E3 x 1E3 chars: 0.5 ms
1E7 chars + 1E2 x 1E4 chars: 0.5 ms
1E7 chars + 1E1 x 1E5 chars: 0.6 ms

1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 1.1 ms
1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 5.4 ms

Private MB: 244.9
Peak Private MB: 270.1

--------------

qx(cat postscriptfile.ps): 209.8 ms
Private MB: 16.2
Peak Private MB: 29.0

###########################################################

newlen += (newlen >> 3) + 10;

C:\wh_fast_perl\bin\perl d:\exe\LongStrings.pl

1E5 chars + 1E4 x 1E2 chars: 12.1 ms
1E6 chars + 1E4 x 1E2 chars: 6.9 ms
1E7 chars + 1E4 x 1E2 chars: 1.4 ms

1E7 chars + 1E5 x 1E1 chars: 10.3 ms
1E7 chars + 1E4 x 1E2 chars: 1.4 ms
1E7 chars + 1E3 x 1E3 chars: 0.5 ms
1E7 chars + 1E2 x 1E4 chars: 0.5 ms
1E7 chars + 1E1 x 1E5 chars: 0.5 ms

1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 1.1 ms
1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 5.6 ms

Private MB: 221.9
Peak Private MB: 244.3

###########################################################

newlen += (newlen >> 4) + 10;

C:\wh_fast_perl\bin\perl d:\exe\LongStrings.pl

1E5 chars + 1E4 x 1E2 chars: 17.0 ms
1E6 chars + 1E4 x 1E2 chars: 13.8 ms
1E7 chars + 1E4 x 1E2 chars: 11.2 ms

1E7 chars + 1E5 x 1E1 chars: 19.4 ms
1E7 chars + 1E4 x 1E2 chars: 10.1 ms
1E7 chars + 1E3 x 1E3 chars: 10.9 ms
1E7 chars + 1E2 x 1E4 chars: 11.1 ms
1E7 chars + 1E1 x 1E5 chars: 11.0 ms

1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 1.2 ms

1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 6.3 ms

Private MB: 219.4
Peak Private MB: 233.8

--------------

qx(cat postscriptfile.ps): 312.0 ms
Private MB: 14.0
Peak Private MB: 25.8

###########################################################

newlen += (newlen >> 6) + 10;

C:\wh_fast_perl\bin\perl d:\exe\LongStrings.pl

1E5 chars + 1E4 x 1E2 chars: 57.7 ms
1E6 chars + 1E4 x 1E2 chars: 59.8 ms
1E7 chars + 1E4 x 1E2 chars: 67.9 ms

1E7 chars + 1E5 x 1E1 chars: 69.4 ms
1E7 chars + 1E4 x 1E2 chars: 71.6 ms
1E7 chars + 1E3 x 1E3 chars: 69.6 ms
1E7 chars + 1E2 x 1E4 chars: 64.8 ms
1E7 chars + 1E1 x 1E5 chars: 53.8 ms

1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 1.2 ms

1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 5.7 ms

Private MB: 219.8
Peak Private MB: 230.0

###########################################################

unpatched Strawberry Perl

c:\strawberry\perl\bin\perl d:\exe\LongStrings.pl

1E5 chars + 1E4 x 1E2 chars: 96.2 ms
1E6 chars + 1E4 x 1E2 chars: 325.7 ms
1E7 chars + 1E4 x 1E2 chars: 2655.9 ms

1E7 chars + 1E5 x 1E1 chars: 2687.3 ms
1E7 chars + 1E4 x 1E2 chars: 2687.4 ms
1E7 chars + 1E3 x 1E3 chars: 2656.1 ms
1E7 chars + 1E2 x 1E4 chars: 1093.6 ms
1E7 chars + 1E1 x 1E5 chars: 108.3 ms

1E7 chars (pre-extend to 2E7) + 1E4 x 1E2 chars: 1.1 ms
1E7 (1E5 x 1E2 chars) array + 1E4 x 1E2 chars : 6.1 ms

Private MB: 200.4
Peak Private MB: 210.2

--------------

qx(cat postscriptfile.ps): 18187.5 ms
Private MB: 13.2
Peak Private MB: 24.9

Peter J. Holzer

unread,
Jul 28, 2010, 3:38:07 PM7/28/10
to
On 2010-07-28 15:09, Wolfram Humann <w.c.h...@arcor.de> wrote:
> I went with Ilya's proposal but inserted the line a little further
> down, just after
> if (newlen > SvLEN(sv)) { /* need more room? */
>
> So now we have:
> if (newlen > SvLEN(sv)) { /* need more room? */
> newlen += (newlen >> 2) + 10;
> #ifndef Perl_safesysmalloc_size
> newlen = PERL_STRLEN_ROUNDUP(newlen);
> #endif
> if (SvLEN(sv) && s) {
> s = (char*)saferealloc(s, newlen);
> }
>
> The remaining question is by what ratio a string's memory should grow.
> I tried several values from (newlen >> 0) to (newlen >> 6) for the
> best compromise between execution time and memory usage and my
> personal favorite is (newlen >> 2). What do others here think?

That sounds about right. I've used factors between 1.2 and 1.5 in the
past for similar problems. I suggest you base the growth on the old
size, though, something like:

if (newlen > SvLEN(sv)) { /* need more room? */

size_t min = SvLEN(sv) * 5/4 + 10;
if (newlen < min) newlen = min;
...

This gives you the same growth pattern if the increments are small, but
it doesn't allocate extra memory if you append a large chunk.

hp

Ilya Zakharevich

unread,
Jul 29, 2010, 12:30:08 AM7/29/10
to
On 2010-07-28, Wolfram Humann <w.c.h...@arcor.de> wrote:
> So now we have:
> if (newlen > SvLEN(sv)) { /* need more room? */
> newlen += (newlen >> 2) + 10;
> #ifndef Perl_safesysmalloc_size
> newlen = PERL_STRLEN_ROUNDUP(newlen);
> #endif
> if (SvLEN(sv) && s) {
> s = (char*)saferealloc(s, newlen);
> }

I think you make your life too simple. What you must do is find the
chain of events which sets
Perl_safesysmalloc_size/PERL_STRLEN_ROUNDUP, and modify this chain. :-(

> I tried several values from (newlen >> 0) to (newlen >> 6) for the
> best compromise between execution time and memory usage and my
> personal favorite is (newlen >> 2). What do others here think?

My approach is never to take responsibility for such decisions.
Make a default value, and shift responsibility to the user. ;-)

#ifndef PERL_STRLEN_ROUNDUP_SHIFT
# define PERL_STRLEN_ROUNDUP_SHIFT 2
#endif

The suggestion to use the OLD length is also very viable...

> 1E5 chars + 1E4 x 1E2 chars: 1.5 ms

I hope your `ms' are actually seconds. It does not make sense to
measure performance on runs shorter than a second (maybe more on Win,
which is doing more unknown stuff in background)...

Yours,
Ilya

Wolfram Humann

unread,
Jul 29, 2010, 11:05:16 AM7/29/10
to
On Jul 29, 6:30 am, Ilya Zakharevich <nospam-ab...@ilyaz.org> wrote:

> On 2010-07-28, Wolfram Humann <w.c.hum...@arcor.de> wrote:
>
> > So now we have:
> >     if (newlen > SvLEN(sv)) {           /* need more room? */
> >    newlen += (newlen >> 2) + 10;
> > #ifndef Perl_safesysmalloc_size
> >    newlen = PERL_STRLEN_ROUNDUP(newlen);
> > #endif
> >    if (SvLEN(sv) && s) {
> >        s = (char*)saferealloc(s, newlen);
> >    }
>
> I think you make your life too simple.  What you must do is find the
> chain of events which sets
> Perl_safesysmalloc_size/PERL_STRLEN_ROUNDUP, and modify this chain.  :-(

I'm traveling territory that's fairly unknown to me already, so if
this is too simple I should leave it to someone more knowledgeable to
do it right. What I *think* is, that PERL_STRLEN_ROUNDUP is just
concerned with memory boundary alignment, e.g. roundup to the next
multiple of 4. If this is the case, it should be independent of any
string memory expansion strategy.

>
> > I tried several values from (newlen >> 0) to (newlen >> 6) for the
> > best compromise between execution time and memory usage and my
> > personal favorite is (newlen >> 2). What do others here think?
>
> My approach is never to take responsibility for such decisions.
> Make a default value, and shift responsibility to the user.  ;-)
>
> #ifndef PERL_STRLEN_ROUNDUP_SHIFT
> #  define PERL_STRLEN_ROUNDUP_SHIFT 2
> #endif

Given that nobody stumbled across the devastating current state, I
don't envision hoards of users trying to optimize this. But I agree
that it serves well as a reminder for someone reading the code that
this is not *the* correct value but a trade-off and could be decided
differently. However, given that IMO it's a different concept from the
existing PERL_STRLEN_ROUNDUP, I would prefer to give it a different
name. How about PERL_STRLEN_EXPAND_SHIFT?

> The suggestion to use the OLD length is also very viable...

Agreed.

> > 1E5 chars + 1E4 x 1E2 chars:    1.5 ms
>
> I hope your `ms' are actually seconds.  It does not make sense to
> measure performance on runs shorter than a second (maybe more on Win,
> which is doing more unknown stuff in background)...

No, these are milliseconds. Yes, this makes it a lousy benchmark. Yes,
these numbers do vary easily by +-30% and sometimes more from run to
run. I tried to post "average" runs, but that's even more subjective,
of course. However the time differences encountered between different
cases are often a factor of 10 or even much more (in the unmodified
case, many of these do run for several seconds ;-)), so I think this
is a valid base for comparison.

So my current proposal reads like this:

#ifndef PERL_STRLEN_EXPAND_SHIFT
# define PERL_STRLEN_EXPAND_SHIFT 2
#endif

if (newlen > SvLEN(sv)) { /* need more room? */

size_t minlen = SvCUR(sv);
minlen += (minlen >> PERL_STRLEN_EXPAND_SHIFT) + 10;
if (newlen < minlen) newlen = minlen;


#ifndef Perl_safesysmalloc_size
newlen = PERL_STRLEN_ROUNDUP(newlen);
#endif
if (SvLEN(sv) && s) {
s = (char*)saferealloc(s, newlen);
}

My benchmark script does run slower now. The previous version did
expand allocated memory beyond minimum requirements during the initial
assignment, so that the reallocation count during append was 0 in many
cases. The new version does not do that so that at least 1 realloc is
required when one starts appending to the string.

Wolfram

0 new messages