[POD follows]
=head2 How can I make my Perl program take less memory?
When it comes to time-space tradeoffs, Perl nearly always prefers to
throw memory at a problem. Scalars in Perl use more memory than strings
in C, arrays take more than that, and hashes use even more. While
there's still a lot to be done, recent releases have been addressing
these issues. For example, as of 5.004, duplicate hash keys are shared
amongst all hashes using them, so require no reallocation.
In some cases, using substr() or vec() to simulate arrays can be
highly beneficial. For example, an array of a thousand booleans will
take at least 20,000 bytes of space, but it can be turned into one
125-byte bit vector for a considerable memory savings. The standard
Tie::SubstrHash module can also help for certain types of data
structure. If you're working with specialist data structures
(matrices, for instance) modules that implement these in C may use
less memory than equivalent Perl modules.
Another thing to try is learning whether your Perl was compiled with
the system malloc or with Perl's builtin malloc. Whichever one it
is, try using the other one and see whether this makes a difference.
Information about malloc is in the F<INSTALL> file in the source
distribution. You can find out whether you are using perl's malloc by
typing C<perl -V:usemymalloc>.
Of course, the best way to save memory is to not do anything to waste
it in the first place. Good programming practices can go a long way
toward this:
=over 4
=item * Don't slurp!
Don't read an entire file into memory if you can process it line
by line. Whenever possible, use this:
while (<FILE>) {
# ...
}
instead of this:
@data = <FILE>;
foreach (@data) {
# ...
}
When the files you're processing are small, it doesn't much matter which
way you do it, but it makes a huge difference when they start getting
larger. The latter method keeps eating up more and more memory, while
the former method scales to files of any size.
If you do need the whole file in memory, read it directly into the data
stucture where it will be used; that way you don't have multiple copies
of data clogging up RAM.
=item * Localize!
Don't make anything global that doesn't have to be. Use my() prodigously
to localize variables to the smallest possible scope. Memory freed by
variables that have gone out of scope can be reused elsewhere,
preventing the need for additional allocations.
=item * Pass by reference
Pass arrays and hashes by reference, not by value. For one thing, it's
the only way to pass multiple lists or hashes (or both) in a single
call/return. It also avoids creating a copy of all the contents. This
requires some judgement, however, because any changes will be propagated
back to the original data. If you really want to mangle (er, modify) a
copy, you'll have to sacrifice the memory needed to make one.
=item * Tie large variables to disk.
For "big" data stores (i.e. ones that exceed available memory) consider
using one of the DB modules to store it on disk instead of in RAM. This
will incur a penalty in access time, but that's probably better that
causing your hard disk to thrash due to massive swapping.
=back
-mjc
I've added your changes to my copy of the perl-5.6.1-TRIAL2 perlfaq3.pod
with the following very minor change. I'll post the changes to perl5porters
and the pumpking if there is no further discussion.
diff -c faqaddition.orig faqaddition
*** faqaddition.orig Mon Mar 5 12:37:26 2001
--- faqaddition Mon Mar 5 12:36:25 2001
***************
*** 56,65 ****
=item * Localize!
! Don't make anything global that doesn't have to be. Use my() prodigously
! to localize variables to the smallest possible scope. Memory freed by
! variables that have gone out of scope can be reused elsewhere,
! preventing the need for additional allocations.
=item * Pass by reference
--- 56,66 ----
=item * Localize!
! Don't make anything global that doesn't have to be. Use my()
! prodigiously to localize variables to the smallest possible scope.
! Memory freed by variables that have gone out of scope can be reused
! elsewhere in the current program, preventing the need for additional
! allocations from system memory.
=item * Pass by reference
***************
*** 78,81 ****
--- 79,83 ----
causing your hard disk to thrash due to massive swapping.
=back
+
--
This space intentionally left blank
> ! Don't make anything global that doesn't have to be. Use my()
> ! prodigiously to localize variables to the smallest possible scope.
> ! Memory freed by variables that have gone out of scope can be reused
> ! elsewhere in the current program, preventing the need for additional
> ! allocations from system memory.
This ignores the fact that memory used by locals *is* reused, but one
used by lexicals *is not*.
But I'm not surprized. Perl's FAQ is much more a political document
than a reliable document....
Hope this helps,
Ilya
Whoa there! That's news to me.
So if I do this:
SOME_BLOCK: {
my @array = (0 .. 1000);
# ...
}
and no references to @array exist outside the block, the memory
allocated to it still won't be freed up for use elsewhere in the program
after I leave the block?
> But I'm not surprized. Perl's FAQ is much more a political document
> than a reliable document....
Apparently, because I've read that memory is reused in other parts of
the FAQ. I thought that I'd seen it elsewhere as well, but as I can't
find a book reference to it right now I may just be recalling reading it
here.
If this is true, then I want to strike that addition. (Okay, Chris?)
Localizing is still good advice, of course, but I'd hate to recommend it
for reasons that are inaccurate.
Side note: Admittedly, my "Pass by reference" addition is a little fuzzy
as well, but I was trying to keep things fairly short and didn't want to
get into all the details and subtle points.
-mjc
On the issue of memory, you should be careful when using map or grep.
This may not be a problem anymore, but in earlier versions (probably 5.003
or 5.004), it appeared that map and grep would potentially cause an entire
file to be slurped.
I forget the exact reason I tested this and the exact syntax of my tests,
but from memory they were similar to these examples
-1- @lines_wanted = grep {m/whatever/} <FILE> ;
-2- while (<FILE>) { push @lines_wanted , $_ if m/whatever/ }
I could bring my machine to a grinding halt by running -1- on very large
files. It appeared from the OS memory stats that the entire file must
have been slurped. -2- had no such problem.
This was probably on 5.003 or 5.004.
Michael Carman <mjca...@home.com> writes:
[...]
>
> =item * Don't slurp!
>
> Don't read an entire file into memory if you can process it line
> by line. Whenever possible, use this:
>
> while (<FILE>) {
> # ...
> }
>
> instead of this:
>
> @data = <FILE>;
> foreach (@data) {
> # ...
> }
and B<never> use this:
for (<FILE>) {
# ...
}
[...]
> =item * Localize!
>
> Don't make anything global that doesn't have to be. Use my() prodigously
> to localize variables to the smallest possible scope. Memory freed by
> variables that have gone out of scope can be reused elsewhere,
> preventing the need for additional allocations.
=item * Avoid unnecessary quotes and stringification
Don't use quote large strings unless absolutely necessary:
my $copy = "$large_string";
makes 2 copies of $large_string (one for $copy and another for
the quotes), whereas
my $copy = $large_string;
only makes one copy.
Ditto for stringifying large arrays:
{
local $, = "\n";
print @big_array;
}
is much more memory-efficient than either
print join "\n", @big_array;
or
{
local $" = "\n";
print "@big_array";
}
If you need to initialize a large variable in your code, you
might consider doing it with an eval statement like this:
my $large_string = eval ' "a" x 5_000_000 ';
This allows perl to immediately free the memory allocated to the
eval statement, but carries a (small) performance penalty.
> =item * Pass by reference
>
> Pass arrays and hashes by reference, not by value. For one thing, it's
> the only way
(sans prototyping)
> to pass multiple lists or hashes (or both) in a single call/return. It
> also avoids creating a copy of all the contents.
<correction>
Array elements are passed by reference, not copied (like hash entries
are). The differences between
(A) foo(\@array)
and
(B) foo(@array)
are
1> (A) avoids (B)'s overhead of setting up the aliases in @_,
2> (A) can manipulate @array itself, (e.g. push, pop, $#array, etc.),
whereas (B) can only modify the values of the elements in @array.
There's probably more, that's all I can think of off the top of my head.
I think you should rework this section a bit.
</correction>
> This requires some judgement, however, because any changes will be
> propagated back to the original data. If you really want to mangle
> (er, modify) a copy, you'll have to sacrifice the memory needed to
> make one.
... If your copy consumes a large amount of RAM, you may want
to explicitly undef() your copy once you are no longer need it. Perl
might then return the additional memory back to the OS.
[...]
Otherwise it looks good to me.
HTH
--
Joe Schaefer "Not everything that counts can be counted, and not everything
that can be counted counts."
--Albert Einstein
Good point. It still does this in 5.6, which makes perfect sense, as map
and grep both expect a list. It would require some special magic to make
this loop over <FILE> instead of slurping it.
-mjc
Since map and grep have no relationship to files, this is meaningless
> -1- @lines_wanted = grep {m/whatever/} <FILE> ;
Here <FILE> is in an array context.
Hope this helps,
Ilya
: On the issue of memory, you should be careful when using map or grep.
: This may not be a problem anymore, but in earlier versions (probably 5.003
: or 5.004), it appeared that map and grep would potentially cause an entire
: file to be slurped.
: I forget the exact reason I tested this and the exact syntax of my tests,
: but from memory they were similar to these examples
: -1- @lines_wanted = grep {m/whatever/} <FILE> ;
: -2- while (<FILE>) { push @lines_wanted , $_ if m/whatever/ }
: I could bring my machine to a grinding halt by running -1- on very large
: files. It appeared from the OS memory stats that the entire file must
: have been slurped. -2- had no such problem.
: This was probably on 5.003 or 5.004.
It has been pointed out to me that the above behaviour has nothing to do
with map or grep. It is because <FILE> is used in an array context, and
is therefore simply a slurp.
However, the point still remains.
Whereas a shell script might sensibly use something like this...
E.g.
make_data | grep condition | cut columns > newfile
which doesn't need much memory,
the intuitively equivalent perl
E.g.
@newlines = map {something} grep {condition} <FILE> ;
may not be a good idea because it may use lots of memory.
I should point out that there's no such thing as array context in
perl--it's either scalar or list. (Ilya mis-spoke when he used
"array context" earlier)
Dan
Based on the comments I've recieved so far (thanks!), here's the current
revision:
=over 4
instead of this:
and B<never> use this:
for (<FILE>) {
# ...
}
When the files you're processing are small, it doesn't much matter which
way you do it, but it makes a huge difference when they start getting
larger. The latter method keeps eating up more and more memory, while
the former method scales to files of any size.
If you do need the whole file in memory, read it directly into the data
structure where it will be used; that way you don't have multiple copies
of data clogging up RAM.
=item * Use map and grep selectively
Remember that both map and grep expect a LIST argument, so doing this:
@wanted = grep {/pattern/} <FILE>;
will cause the entire file to be slurped. For large files, it's better
to loop:
while (<FILE>) {
push(@wanted, $_) if /pattern/;
}
=item * Avoid unnecessary quotes and stringification
Don't quote large strings unless absolutely necessary:
my $copy = "$large_string";
makes 2 copies of $large_string (one for $copy and another for the
quotes), whereas
my $copy = $large_string;
only makes one copy.
Ditto for stringifying large arrays:
{
local $, = "\n";
print @big_array;
}
is much more memory-efficient than either
print join "\n", @big_array;
or
{
local $" = "\n";
print "@big_array";
}
=item * Consider using C<eval BLOCK>
If you need to initialize a large variable in your code, you
might consider doing it with an eval statement like this:
my $large_string = eval ' "a" x 5_000_000 ';
This allows perl to immediately free the memory allocated to the
eval statement, but carries a (small) performance penalty.
=item * Pass by reference
Pass arrays and hashes by reference. Perl always passes references, but
calling
foo(@array);
passes a reference to I<each element> of @array, whereas calling
foo(\@array);
passes only one reference to @array itself. This requires some
judgement, however, because any changes will be propagated back to the
original data. If you really want to mangle (er, modify) a copy, you'll
have to sacrifice the memory needed to make one.
Note: This is also the only way (sans prototyping) to pass multiple
lists and/or hashes in a single call.
=item * Tie large variables to disk.
For "big" data stores (i.e. ones that exceed available memory) consider
using one of the DB modules to store it on disk instead of in RAM. This
will incur a penalty in access time, but that's probably better that
causing your hard disk to thrash due to massive swapping.
=item * Clean out the trash
If you have a variable which consumes a large amount of RAM, you may
want to explicitly undef() once it's no longer needed. Perl might then
return the additional memory back to the OS.
=back
Does that really hit memory harder than the explicit slurp, or is it
just ugly?
> =item * Avoid unnecessary quotes and stringification
> [...]
> Ditto for stringifying large arrays:
>
Interesting; I'd never thought about that before.
> If you need to initialize a large variable in your code, you
> might consider doing it with an eval statement like this:
>
> my $large_string = eval ' "a" x 5_000_000 ';
>
> This allows perl to immediately free the memory allocated to the
> eval statement, but carries a (small) performance penalty.
You're just full of ideas, aren't you?
>> =item * Pass by reference
>>
>> Pass arrays and hashes by reference, not by value. For one thing, it's
>> the only way
>
> (sans prototyping)
Noted.
>> to pass multiple lists or hashes (or both) in a single call/return. It
>> also avoids creating a copy of all the contents.
>
> <correction>
>
> Array elements are passed by reference, not copied
> [...]
> I think you should rework this section a bit.
Yes, I know, but I was trying to avoid having so much detail that when a
newbie checks the docs (hooray!) they can't see the forest for the
trees.
I agree it needs a little work, and will try to come up with something
better. (Accurate but concise.)
> ... If your copy consumes a large amount of RAM, you may want
> to explicitly undef() your copy once you are no longer need it. Perl
> might then return the additional memory back to the OS.
Hmm. I'd agree with you if not for what Ilya said in another branch of
this thread. It *should* help, but will it?
Thanks for your comments.
-mjc
> Joe Schaefer wrote:
> >
> > Michael Carman <mjca...@home.com> writes:
> >>
> >> =item * Don't slurp!
> >>
> >> [...]
> >
> > and B<never> use this:
> >
> > for (<FILE>) {
> > # ...
> > }
>
> Does that really hit memory harder than the explicit slurp, or is it
> just ugly?
Dunno, but it's certainly worse than
while (<FILE>) {
#...
}
which is what you recommended to use. I can't think of a
situation where "for(<FILE>)" would be reasonable, yet I've
seen it appear in clp.misc on occasion. I just thought an
explicit "don't do this" was warranted for the FAQ.
[...]
> > ... If your copy consumes a large amount of RAM, you may want
> > to explicitly undef() your copy once you are no longer need it. Perl
> > might then return the additional memory back to the OS.
>
> Hmm. I'd agree with you if not for what Ilya said in another branch of
> this thread. It *should* help, but will it?
If I understood Ilya correctly, he was discussing perl's internal reuse
(or lack thereof) for memory allocated to lexicals. undef()'ing a
_large_ variable usually (1) causes perl to return the memory to the OS.
Generally I don't think there's anything to be gained (2) by undef()ing
lots of "normal-sized" variables, and it's certainly not a very Perl-ish
thing to do.
(1) - anecdotal to be sure, but it works for me on linux with 5.005_03
or better. There was thread about a month ago where Jerome Abela and
I were trying to flesh this out via trial and error, and most of the
remarks I made here were based on that discussion:
Of course, an expert like Ilya who is intimately familiar with the gc
could certainly do a better job than I did.
(2) a Silvio Dante-ism
Best.
--
Joe Schaefer "If you pick up a starving dog and make him prosperous, he will
not bite you. This is the principal difference between a dog and
a man."
--Mark Twain
Hmm.. surely that's C<eval EXPR> rather than C<eval BLOCK>?
--
Ilmari Karonen - http://www.sci.fi/~iltzu/
"These fine people, forming the dot in ROOT-SERVERS DOT NET DOT have given
us a bloody SERVICE PACK!" -- Pim van Riezen in the monastery
Please ignore Godzilla / Kira -- do not feed the troll.
Oops, I said BLOCK because I was thinking to use the compile at
compile-time form instead of compile at runtime; never mind what the
example showed. :) Since one probably can't generalize to say that the
BLOCK form is always feasible I'll change the header to say simply
C<eval>.
-mjc
Well, in terms of ugliness,
for my $x ( <> )
is considerably better than
while ( defined( $x = <> ) )
--
John Porter
Useless use of time in void context.
> Don't read an entire file into memory if you can process it line
> by line. Whenever possible, use this:
>
> while (<FILE>) {
> # ...
> }
>
> instead of this:
>
> @data = <FILE>;
> foreach (@data) {
> # ...
> }
>
> and B<never> use this:
>
> for (<FILE>) {
> # ...
> }
>
> When the files you're processing are small, it doesn't much matter
> which way you do it, but it makes a huge difference when they start
> getting larger. The latter method keeps eating up more and more
> memory, while the former method scales to files of any size.
Does it make sense to use the terms `former' and `latter' when there
are three items? What about the middle item, then?
kai
--
Be indiscrete. Do it continuously.
According to Kai Großjohann <Kai.Gro...@CS.Uni-Dortmund.DE>:
Well, there are only two cases here, really. The first example
avoids reading all of the file at once, the other two do. The only
difference is that the middle example *may* be justified if something
else is done with @array later (and it had better be non-sequential).
The third approach is indefensible because it reads the file into an
anonymous array and then goes over it sequentially. There's no way
the array can be put to any use, it's just wasted space.
Anno
Uh, this comes as a surprise to me. I had expected perl to optimize
the first form to do the same as the second form.
>=item * Consider using C<eval BLOCK>
>
>If you need to initialize a large variable in your code, you
>might consider doing it with an eval statement like this:
>
> my $large_string = eval ' "a" x 5_000_000 ';
>
>This allows perl to immediately free the memory allocated to the
>eval statement, but carries a (small) performance penalty.
Sorry, I just don't get it: which "memory allocated to the eval"?
And also, without the eval there is no such memory allocated in the
first place, since there is no eval, right? So, what does
my $large_string = "a" x 5_000_000;
allocate, what the eval version does free?
Otherwise I like the text.
--
Heiner Marxen hei...@drb.insel.de http://www.drb.insel.de/~heiner/
>>=item * Consider using C<eval BLOCK>
>>
>>If you need to initialize a large variable in your code, you
>>might consider doing it with an eval statement like this:
>>
>> my $large_string = eval ' "a" x 5_000_000 ';
>>
>>This allows perl to immediately free the memory allocated to the
>>eval statement, but carries a (small) performance penalty.
> Sorry, I just don't get it: which "memory allocated to the eval"?
The memory used to create the large string by concatenating the
q<"a"> five million times.
> And also, without the eval there is no such memory allocated in the
> first place, since there is no eval, right? So, what does
The memory is allocated, but to the main program instead of the
eval. I think it therefore does not get released as soon.
> my $large_string = "a" x 5_000_000;
> allocate, what the eval version does free?
They both allocate the same (except for the overhead of the eval).
The difference is when the memory gets freed.
Chris
--
Christopher E. Stith
You must not lose faith in humanity. Humanity is an ocean;
if a few drops of the ocean are dirty, the ocean does not
become dirty. -- Mohandas K. Gandhi