Please help: perl run out of memory

wilson

unread,

Apr 17, 2022, 5:45:05 AM4/17/22

to begi...@perl.org

hello the experts,

can you help check my script for how to optimize it?
currently it was going as "run out of memory".

$ perl count.pl
Out of memory!
Killed

My script:
use strict;

my %hash;
my %stat;

# dataset: userId, itemId, rate, time
# AV056ETQ5RXLN,0000031887,1.0,1397692800

open HD,"rate.csv" or die $!;
while(<HD>) {
my ($item,$rate) = (split /\,/)[1,2];
$hash{$item}{total} += $rate;
$hash{$item}{count} +=1;
}
close HD;

for my $key (keys %hash) {
$stat{$key} = $hash{$key}{total} / $hash{$key}{count};
}

my $i = 0;
for (sort { $stat{$b} <=> $stat{$a}} keys %stat) {
print "$_: $stat{$_}\n";
last if $i == 99;
$i ++;
}

The purpose is to aggregate and average the itemId's scores, and print
the result after sorting.

The dataset has 80+ million items:

$ wc -l rate.csv
82677131 rate.csv

And my memory is somewhat limited:

$ free -m
total used free shared buff/cache
available
Mem: 1992 152 76 0 1763
1700
Swap: 1023 802 221

What confused me is that Apache Spark can make this job done with this
limited memory. It got the statistics done within 2 minutes. But I want
to give perl a try since it's not that convenient to run a spark job always.

The spark implementation:

scala> val schema="uid STRING,item STRING,rate FLOAT,time INT"
val schema: String = uid STRING,item STRING,rate FLOAT,time INT

scala> val df =
spark.read.format("csv").schema(schema).load("skydrive/rate.csv")
val df: org.apache.spark.sql.DataFrame = [uid: string, item: string ...
2 more fields]

scala>
df.groupBy("item").agg(avg("rate").alias("avg_rate")).orderBy(desc("avg_rate")).show()
+----------+--------+

| item|avg_rate|
+----------+--------+
|0001061100| 5.0|
|0001543849| 5.0|
|0001061127| 5.0|
|0001019880| 5.0|
|0001062395| 5.0|
|0000143502| 5.0|
|000014357X| 5.0|
|0001527665| 5.0|
|000107461X| 5.0|
|0000191639| 5.0|
|0001127748| 5.0|
|0000791156| 5.0|
|0001203088| 5.0|
|0001053744| 5.0|
|0001360183| 5.0|
|0001042335| 5.0|
|0001374400| 5.0|
|0001046810| 5.0|
|0001380877| 5.0|
|0001050230| 5.0|
+----------+--------+
only showing top 20 rows

I think my perl script should be possible to be optimized to run this
job as well. So ask for your helps.

Thanks in advance.

wilson

wilson

unread,

Apr 22, 2022, 12:45:05 AM4/22/22

to begi...@perl.org

yes the script is suitable for a small dataset.
I have updated with another statistics job with the smaller dataset,
please check:
https://bigcount.xyz/script-and-spark-for-small-dataset.html

regards

David Precious wrote:
> Given that the OP is running into memory issues processing an 80+
> million line file, I don't think suggesting a CPAN module designed to
> read the entire contents of a file into memory is going to be very
> helpful

David Precious

unread,

Apr 22, 2022, 5:15:06 AM4/22/22

to begi...@perl.org

On Thu, 21 Apr 2022 07:12:07 -0700
al...@coakmail.com wrote:

> OP maybe need the streaming IO for reading files.

Which is what they were already doing - they used:

while (<HD>) {
...
}

Which, under the hood, uses readline, to read a line at a time.

(where "HD" is their global filehandle - a lexical filehandle would
have been better, but makes no difference here)

You can use B::Deparse to see that the above deparses to a use of
readline:

[davidp@columbia:~]$ cat tmp/readline
#!/usr/bin/env perl

while (<STDIN>) {
print "Line: $_\n";
}

[davidp@columbia:~]$ perl -MO=Deparse tmp/readline
while (defined($_ = readline STDIN)) {
print "Line: $_\n";
}
tmp/readline syntax OK

So, they're already reading line-wise, it seems they're just
running in to memory usage issues from holding a hash of 80+million
values, which is not super suprising on a reasonably low-memory box.

Personally, if it were me, I'd go one of two ways:

* just throw some more RAM at it - these days, RAM is far cheaper than
programmer time trying to squeeze it into the least amount of bytes,
especially true if it's a quick "Get It Done" solution

* hand it off to a tool made for the job - import the data into SQLite
or some other DB engine and let it do what it's designed for, as it's
likely to be far more efficient than a hand-rolled Perl solution.
(They already proved that Apache Spark can handle it on the same
hardware)