Frequency in large datasets

Cosmic Cruizer

unread,

Apr 30, 2008, 10:15:51 PM4/30/08

to

I've been able to reduce my dataset by 75%, but it still leaves me with a
file of 47 gigs. I'm trying to find the frequency of each line using:

open(TEMP, "< $tempfile") || die "cannot open file $tempfile:
$!";
foreach (<TEMP>) {
$seen{$_}++;
}
close(TEMP) || die "cannot close file
$tempfile: $!";

My program keeps aborting after a few minutes because the computer runs out
of memory. I have four gigs of ram and the total paging files is 10 megs,
but Perl does not appear to be using it.

How can I find the frequency of each line using such a large dataset? I
tried to have two output files where I kept moving the databack and forth
each time I grabbed the next line from TEMP instead of using $seen{$_}++,
but I did not have much success.

Gunnar Hjalmarsson

unread,

Apr 30, 2008, 10:24:51 PM4/30/08

to

Cosmic Cruizer wrote:
> I've been able to reduce my dataset by 75%, but it still leaves me with a
> file of 47 gigs. I'm trying to find the frequency of each line using:
>
> open(TEMP, "< $tempfile") || die "cannot open file $tempfile:
> $!";
> foreach (<TEMP>) {
> $seen{$_}++;
> }
> close(TEMP) || die "cannot close file
> $tempfile: $!";
>
> My program keeps aborting after a few minutes because the computer runs out
> of memory.

This line:

> foreach (<TEMP>) {

reads the whole file into memory. You should read the file line by line
instead by replacing it with:

while (<TEMP>) {

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl

A. Sinan Unur

unread,

Apr 30, 2008, 10:31:40 PM4/30/08

to

Cosmic Cruizer <XXjbh...@white-star.com> wrote in
news:Xns9A90C3D86EFCE...@207.115.17.102:

> I've been able to reduce my dataset by 75%, but it still leaves me
> with a file of 47 gigs. I'm trying to find the frequency of each line
> using:
>
> open(TEMP, "< $tempfile") || die "cannot open file
> $tempfile:
> $!";
> foreach (<TEMP>) {

Well, that is simply silly. You have a huge file yet you try to read all
of it into memory. Ain't gonna work.

How long is each line and how many unique lines do you expect?

If the number of unique lines is small relative to the number of total
lines, I do not see any difficulty if you get rid of the boneheaded for
loop.

> $seen{$_}++;
> }
> close(TEMP) || die "cannot close file
> $tempfile: $!";

my %seen;

open my $TEMP, '<', $tempfile
or die "Cannot open '$tempfile': $!";

++ $seen{ $_ } while <$TEMP>;

close $TEMP
or die "Cannot close '$tempfile': $!";

> My program keeps aborting after a few minutes because the computer
> runs out of memory. I have four gigs of ram and the total paging files
> is 10 megs, but Perl does not appear to be using it.

I don't see much point to having a 10 MB swap file. To make the best use
of 4 GB physical memory, AFAIK, you need to be running a 64 bit OS.

> How can I find the frequency of each line using such a large dataset?
> I tried to have two output files where I kept moving the databack and
> forth each time I grabbed the next line from TEMP instead of using
> $seen{$_}++, but I did not have much success.

If the number of unique lines is large, I would periodically store the
current counts, clear the hash, keep processing the original file. Then,
when you reach the end of the original data file, go back to the stored
counts (which will have multiple entries for each unique line) and
aggregate the information there.

Sinan

--
A. Sinan Unur <1u...@llenroc.ude.invalid>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/

xho...@gmail.com

unread,

Apr 30, 2008, 10:38:25 PM4/30/08

to

Cosmic Cruizer <XXjbh...@white-star.com> wrote:
> I've been able to reduce my dataset by 75%, but it still leaves me with a
> file of 47 gigs. I'm trying to find the frequency of each line using:
>
> open(TEMP, "< $tempfile") || die "cannot open file
> $tempfile: $!";
> foreach (<TEMP>) {
> $seen{$_}++;
> }
> close(TEMP) || die "cannot close file
> $tempfile: $!";

If each line shows up a million times on average, that shouldn't
be a problem. If each line shows up twice on average, then it won't
work so well with 4G of RAM. We don't which of those is closer to your
case.

> My program keeps aborting after a few minutes because the computer runs
> out of memory. I have four gigs of ram and the total paging files is 10
> megs, but Perl does not appear to be using it.

If the program is killed due to running out of memory, then I would
say that the program *does* appear to be using the available memory. What
makes you think it isn't using it?

> How can I find the frequency of each line using such a large dataset?

I probably wouldn't use Perl, but rather the OS's utilities. For example
on linux:

sort big_file | uniq -c

> I
> tried to have two output files where I kept moving the databack and forth
> each time I grabbed the next line from TEMP instead of using $seen{$_}++,
> but I did not have much success.

But in line 42.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

xho...@gmail.com

unread,

Apr 30, 2008, 10:39:23 PM4/30/08

to

Gunnar Hjalmarsson <nor...@gunnar.cc> wrote:
> Cosmic Cruizer wrote:
> > I've been able to reduce my dataset by 75%, but it still leaves me with
> > a file of 47 gigs. I'm trying to find the frequency of each line using:
> >
> > open(TEMP, "< $tempfile") || die "cannot open file
> > $tempfile: $!";
> > foreach (<TEMP>) {
> > $seen{$_}++;
> > }
> > close(TEMP) || die "cannot close file
> > $tempfile: $!";
> >
> > My program keeps aborting after a few minutes because the computer runs
> > out of memory.
>
> This line:
>
> > foreach (<TEMP>) {
>
> reads the whole file into memory. You should read the file line by line
> instead by replacing it with:
>
> while (<TEMP>) {

Duh, I completely overlooked that.

Cosmic Cruizer

unread,

Apr 30, 2008, 11:32:45 PM4/30/08

to

Gunnar Hjalmarsson <nor...@gunnar.cc> wrote in
news:67so01F...@mid.individual.net:

> Cosmic Cruizer wrote:
>> I've been able to reduce my dataset by 75%, but it still leaves me
>> with a file of 47 gigs. I'm trying to find the frequency of each line
>> using:
>>
>> open(TEMP, "< $tempfile") || die "cannot open file
>> $tempfile:
>> $!";
>> foreach (<TEMP>) {
>> $seen{$_}++;
>> }
>> close(TEMP) || die "cannot close file
>> $tempfile: $!";
>>
>> My program keeps aborting after a few minutes because the computer
>> runs out of memory.
>
> This line:
>
>> foreach (<TEMP>) {
>
> reads the whole file into memory. You should read the file line by
> line instead by replacing it with:
>
> while (<TEMP>) {
>

<sigh> As both you and Sinan pointed out... I'm using foreach. Everywhere
else I used the while statement to get me to this point. This solves the
problem.

Thank you.

Jürgen Exner

unread,

Apr 30, 2008, 11:44:30 PM4/30/08

to

Cosmic Cruizer <XXjbh...@white-star.com> wrote:
>I've been able to reduce my dataset by 75%, but it still leaves me with a
>file of 47 gigs. I'm trying to find the frequency of each line using:
>
> open(TEMP, "< $tempfile") || die "cannot open file $tempfile:
>$!";
> foreach (<TEMP>) {

This slurps the whole file (yes, all 47GB) inot a list and then iterates
over that list. Read the file line-by-line instead:

while (<TEMP>){

This should work unless you have a lot of different data points.

jue

Ben Bullock

unread,

May 1, 2008, 1:06:04 AM5/1/08

to

A. Sinan Unur <1u...@llenroc.ude.invalid> wrote:
> Cosmic Cruizer <XXjbh...@white-star.com> wrote in
> news:Xns9A90C3D86EFCE...@207.115.17.102:
>
>> I've been able to reduce my dataset by 75%, but it still leaves me
>> with a file of 47 gigs. I'm trying to find the frequency of each line
>> using:
>>
>> open(TEMP, "< $tempfile") || die "cannot open file
>> $tempfile:
>> $!";
>> foreach (<TEMP>) {
>
> Well, that is simply silly. You have a huge file yet you try to read all
> of it into memory. Ain't gonna work.

I'm not sure why it's silly as such - perhaps he didn't know that
"foreach" would read all the file into memory.

> If the number of unique lines is small relative to the number of total
> lines, I do not see any difficulty if you get rid of the boneheaded for
> loop.

Again, why is it "boneheaded"? The fact that foreach reads the entire
file into memory isn't something I'd expect people to know
automatically.

A. Sinan Unur

unread,

May 1, 2008, 7:26:40 AM5/1/08

to

benkasmi...@gmail.com (Ben Bullock) wrote in
news:fvbj3s$l7u$1...@ml.accsnet.ne.jp:

> A. Sinan Unur <1u...@llenroc.ude.invalid> wrote:
>> Cosmic Cruizer <XXjbh...@white-star.com> wrote in
>> news:Xns9A90C3D86EFCE...@207.115.17.102:
>>

...

>>> foreach (<TEMP>) {
>>
>> Well, that is simply silly. You have a huge file yet you try to read
>> all of it into memory. Ain't gonna work.
>
> I'm not sure why it's silly as such - perhaps he didn't know that
> "foreach" would read all the file into memory.

Well, I assumed he didn't. But this is one of those things, had I found
myself doing it, after spending hours and hours trying to work out a way
of processing the file, I would have slapped my forehead and said, "now
that was just a silly thing to do". Coupled with the "ain't" I assumed
my meaning was clear. I wasn't calling the OP names, but trying to get a
message across very strongly.

>> If the number of unique lines is small relative to the number of
>> total lines, I do not see any difficulty if you get rid of the
>> boneheaded for loop.
>
> Again, why is it "boneheaded"?

Because there is no hope of anything working so long as that for loop is
there.

> The fact that foreach reads the entire file into memory isn't
> something I'd expect people to know automatically.

Maybe this helps:

From perlfaq3.pod:

<blockquote>
* How can I make my Perl program take less memory?

...

Of course, the best way to save memory is to not do anything to waste it
in the first place. Good programming practices can go a long way toward
this:

* Don't slurp!

Don't read an entire file into memory if you can process it line by
line. Or more concretely, use a loop like this:
</blockquote>

Maybe you would like to read the rest.

So, calling the for loop boneheaded is a little stronger than "Bad
Idea", but then what is simply a bad idea with a 200 MB file (things
will still work but less efficiently) is boneheaded with a 47 GB file
(there is no chance of the program working).

There is a reason "Don't slurp!" appears with an exclamation mark and as
the first recommendation in the FAQ list answer.

Hope this helps you become more comfortable with the notion that reading
a 47 GB file is a boneheaded move. It is boneheaded if I do it, if Larry
Wall does it, if Superman does it ... you get the picture I hope.

nolo contendere

unread,

May 1, 2008, 11:54:14 AM5/1/08

to

On May 1, 7:26 am, "A. Sinan Unur" <1...@llenroc.ude.invalid> wrote:
> benkasminbull...@gmail.com (Ben Bullock) wrote innews:fvbj3s$l7u$1...@ml.accsnet.ne.jp:

>
> > A. Sinan Unur <1...@llenroc.ude.invalid> wrote:
>
> Hope this helps you become more comfortable with the notion that reading
> a 47 GB file is a boneheaded move. It is boneheaded if I do it, if Larry
> Wall does it, if Superman does it ... you get the picture I hope.
>

I don't think it would be boneheaded if Superman did it...I mean, he's
SUPERMAN.

A. Sinan Unur

unread,

May 1, 2008, 11:57:45 AM5/1/08

to

nolo contendere <simon...@fmr.com> wrote in
news:1aa7f96f-7458-4d3a...@b64g2000hsa.googlegroups.com:

But attempting to slurp a 47 GB files is the equivalent of having a
cryptonite slurpee in the morning.

Not good.

;-)

Chris Mattern

unread,

May 1, 2008, 12:40:54 PM5/1/08

to

On 2008-05-01, Gunnar Hjalmarsson <nor...@gunnar.cc> wrote:
> Cosmic Cruizer wrote:
>> I've been able to reduce my dataset by 75%, but it still leaves me with a
>> file of 47 gigs. I'm trying to find the frequency of each line using:
>>
>> open(TEMP, "< $tempfile") || die "cannot open file $tempfile:
>> $!";
>> foreach (<TEMP>) {
>> $seen{$_}++;
>> }
>> close(TEMP) || die "cannot close file
>> $tempfile: $!";
>>
>> My program keeps aborting after a few minutes because the computer runs out
>> of memory.
>
> This line:
>
>> foreach (<TEMP>) {
>
> reads the whole file into memory. You should read the file line by line
> instead by replacing it with:
>
> while (<TEMP>) {
>

Which still leaves him with a hash that keeps each unique line in the file
as a separate key. Betcha it doesn't fit. Basic UNIX utilities can do
this, though I will admit I can't guarantee that sort can handle something
this big:

sort tempfile | uniq -c

--
Christopher Mattern

NOTICE
Thank you for noticing this new notice
Your noticing it has been noted
And will be reported to the authorities

Chris Mattern

unread,

May 1, 2008, 12:42:16 PM5/1/08

to

Didn't realize your file had so many duplicates (and thus such a small
set of unique lines). If it works, that's great!

Chris Mattern

unread,

May 1, 2008, 12:43:22 PM5/1/08

to

Hey, Superman can do boneheaded things. It's not like he's Chuck Norris.

Uri Guttman

unread,

May 1, 2008, 2:38:49 PM5/1/08

to

>>>>> "ASU" == A Sinan Unur <1u...@llenroc.ude.invalid> writes:

ASU> nolo contendere <simon...@fmr.com> wrote in
ASU> news:1aa7f96f-7458-4d3a...@b64g2000hsa.googlegroups.com:

>> On May 1, 7:26 am, "A. Sinan Unur" <1...@llenroc.ude.invalid> wrote:
>>> benkasminbull...@gmail.com (Ben Bullock) wrote
>>> innews:fvbj3s$l7u$1...@ml.accs
>> net.ne.jp:
>>>
>>> > A. Sinan Unur <1...@llenroc.ude.invalid> wrote:
>>>
>>> Hope this helps you become more comfortable with the notion that
>>> reading a 47 GB file is a boneheaded move. It is boneheaded if I do
>>> it, if Larry Wall does it, if Superman does it ... you get the
>>> picture I hope.
>>>
>>
>> I don't think it would be boneheaded if Superman did it...I mean, he's
>> SUPERMAN.

ASU> But attempting to slurp a 47 GB files is the equivalent of having a
ASU> cryptonite slurpee in the morning.

ASU> Not good.

ASU> ;-)

and i wouldn't even recommend file::slurp for that job!! :)

uri

--
Uri Guttman ------ u...@stemsystems.com -------- http://www.sysarch.com --
----- Perl Code Review , Architecture, Development, Training, Support ------
--------- Free Perl Training --- http://perlhunter.com/college.html ---------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------

John W. Krahn

unread,

May 1, 2008, 2:43:20 PM5/1/08

to

A. Sinan Unur wrote:
> nolo contendere <simon...@fmr.com> wrote in
> news:1aa7f96f-7458-4d3a...@b64g2000hsa.googlegroups.com:
>
>> On May 1, 7:26 am, "A. Sinan Unur" <1...@llenroc.ude.invalid> wrote:
>>> benkasminbull...@gmail.com (Ben Bullock) wrote
>>> innews:fvbj3s$l7u$1...@ml.accs
>> net.ne.jp:
>>>> A. Sinan Unur <1...@llenroc.ude.invalid> wrote:
>>> Hope this helps you become more comfortable with the notion that
>>> reading a 47 GB file is a boneheaded move. It is boneheaded if I do
>>> it, if Larry Wall does it, if Superman does it ... you get the
>>> picture I hope.
>>>
>> I don't think it would be boneheaded if Superman did it...I mean, he's
>> SUPERMAN.
>
> But attempting to slurp a 47 GB files is the equivalent of having a
> cryptonite slurpee in the morning.

s/cryptonite/kryptonite/;

John
--
Perl isn't a toolbox, but a small machine shop where you
can special-order certain sorts of tools at low cost and
in short order. -- Larry Wall

Uri Guttman

unread,

May 1, 2008, 2:44:45 PM5/1/08

to

>>>>> "JWK" == John W Krahn <som...@example.com> writes:

JWK> A. Sinan Unur wrote:
>>> I don't think it would be boneheaded if Superman did it...I mean, he's
>>> SUPERMAN.
>> But attempting to slurp a 47 GB files is the equivalent of having a
>> cryptonite slurpee in the morning.

JWK> s/cryptonite/kryptonite/;

what if the 47 GB was enkrypted? (sic :)

Cosmic Cruizer

unread,

May 1, 2008, 11:26:37 PM5/1/08

to

Cosmic Cruizer <XXjbh...@white-star.com> wrote in

news:Xns9A90D0E1FD16c...@207.115.17.102:

Well... that did not make any difference at all. I still get up to about
90% of the physical ram and the job aborts within about the same
timeframe. From what I can tell, using while did not make any difference
than using foreach. I tried using the two swapfiles idea, but that is not
a viable solution. I guess the only thing to do is to break the files
down into smaller chunks of about 5 gigs each. That will give me about 3
to 4 days worth of data at a time. After that, I can look at what I have
and decide how I can optimize the data for the next run.

Sherman Pendley

unread,

May 2, 2008, 12:42:59 AM5/2/08

to

nolo contendere <simon...@fmr.com> writes:

It's Superman (an individual, the guy in tights) or superman (generic, not a
proper name, as in "a superman"), but never SUPERMAN. It's not an acronym,
folklore notwithstanding.

Newbs...

sherm--

--
My blog: http://shermspace.blogspot.com
Cocoa programming in Perl: http://camelbones.sourceforge.net

comp.llang.perl.moderated

unread,

May 2, 2008, 3:26:48 PM5/2/08

to

On May 1, 8:26 pm, Cosmic Cruizer <XXjbhun...@white-star.com> wrote:
> Cosmic Cruizer <XXjbhun...@white-star.com> wrote innews:Xns9A90D0E1FD16c...@207.115.17.102:
>
>
>
>
>
> > Gunnar Hjalmarsson <nore...@gunnar.cc> wrote in

While slower, you could use a DBM if %seen is
overgrowing memory, eg,

my $db = tie %seen, 'DB_File', [$filename, $flags, $mode, $DB_HASH]
or die ...

--
Charles DeRykus