Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Please help: perl run out of memory

Skip to first unread message


Apr 17, 2022, 5:45:05 AM4/17/22
hello the experts,

can you help check my script for how to optimize it?
currently it was going as "run out of memory".

$ perl
Out of memory!

My script:
use strict;

my %hash;
my %stat;

# dataset: userId, itemId, rate, time
# AV056ETQ5RXLN,0000031887,1.0,1397692800

open HD,"rate.csv" or die $!;
while(<HD>) {
my ($item,$rate) = (split /\,/)[1,2];
$hash{$item}{total} += $rate;
$hash{$item}{count} +=1;
close HD;

for my $key (keys %hash) {
$stat{$key} = $hash{$key}{total} / $hash{$key}{count};

my $i = 0;
for (sort { $stat{$b} <=> $stat{$a}} keys %stat) {
print "$_: $stat{$_}\n";
last if $i == 99;
$i ++;

The purpose is to aggregate and average the itemId's scores, and print
the result after sorting.

The dataset has 80+ million items:

$ wc -l rate.csv
82677131 rate.csv

And my memory is somewhat limited:

$ free -m
total used free shared buff/cache
Mem: 1992 152 76 0 1763
Swap: 1023 802 221

What confused me is that Apache Spark can make this job done with this
limited memory. It got the statistics done within 2 minutes. But I want
to give perl a try since it's not that convenient to run a spark job always.

The spark implementation:

scala> val schema="uid STRING,item STRING,rate FLOAT,time INT"
val schema: String = uid STRING,item STRING,rate FLOAT,time INT

scala> val df ="csv").schema(schema).load("skydrive/rate.csv")
val df: org.apache.spark.sql.DataFrame = [uid: string, item: string ...
2 more fields]


| item|avg_rate|
|0001061100| 5.0|
|0001543849| 5.0|
|0001061127| 5.0|
|0001019880| 5.0|
|0001062395| 5.0|
|0000143502| 5.0|
|000014357X| 5.0|
|0001527665| 5.0|
|000107461X| 5.0|
|0000191639| 5.0|
|0001127748| 5.0|
|0000791156| 5.0|
|0001203088| 5.0|
|0001053744| 5.0|
|0001360183| 5.0|
|0001042335| 5.0|
|0001374400| 5.0|
|0001046810| 5.0|
|0001380877| 5.0|
|0001050230| 5.0|
only showing top 20 rows

I think my perl script should be possible to be optimized to run this
job as well. So ask for your helps.

Thanks in advance.



Apr 22, 2022, 12:45:05 AM4/22/22
yes the script is suitable for a small dataset.
I have updated with another statistics job with the smaller dataset,
please check:


David Precious wrote:
> Given that the OP is running into memory issues processing an 80+
> million line file, I don't think suggesting a CPAN module designed to
> read the entire contents of a file into memory is going to be very
> helpful

David Precious

Apr 22, 2022, 5:15:06 AM4/22/22
On Thu, 21 Apr 2022 07:12:07 -0700 wrote:

> OP maybe need the streaming IO for reading files.

Which is what they were already doing - they used:

while (<HD>) {

Which, under the hood, uses readline, to read a line at a time.

(where "HD" is their global filehandle - a lexical filehandle would
have been better, but makes no difference here)

You can use B::Deparse to see that the above deparses to a use of

[davidp@columbia:~]$ cat tmp/readline
#!/usr/bin/env perl

while (<STDIN>) {
print "Line: $_\n";

[davidp@columbia:~]$ perl -MO=Deparse tmp/readline
while (defined($_ = readline STDIN)) {
print "Line: $_\n";
tmp/readline syntax OK

So, they're already reading line-wise, it seems they're just
running in to memory usage issues from holding a hash of 80+million
values, which is not super suprising on a reasonably low-memory box.

Personally, if it were me, I'd go one of two ways:

* just throw some more RAM at it - these days, RAM is far cheaper than
programmer time trying to squeeze it into the least amount of bytes,
especially true if it's a quick "Get It Done" solution

* hand it off to a tool made for the job - import the data into SQLite
or some other DB engine and let it do what it's designed for, as it's
likely to be far more efficient than a hand-rolled Perl solution.
(They already proved that Apache Spark can handle it on the same

0 new messages