GREP: finding the most frequent associations of 3 words (any word) within a corpus

Jean

unread,

Jun 25, 2012, 1:55:53 PM6/25/12

to

Greetings unix gurus !

I've got several ebooks in .txt format (in which words are separated
by one or several spaces) and I am trying to produce linguistic
statistics out of them.
I'd like to know if there is a simple way with grep - or any other
utility - to print out the most frequent associations of 3 words that
appear within my textual corpus (by "most frequent", I mean: which
appear at least 3 times throughout the whole corpus).

Thank you very much !

Ed Morton

unread,

Jun 25, 2012, 2:07:31 PM6/25/12

to

awk is the tool you want but please post some sample input and expected output
as the statements above could be interpreted in various different ways.

Ed.

jeanb...@gmail.com

unread,

Jun 25, 2012, 2:23:54 PM6/25/12

to

Thank you for answering me Ed.

sample input:

Once upon a time, there was a princess who had no luck. She had no luck and nobody knew why. There was a prince, though, who...
The phrase "once upon a time" is commonly used in fairy tales [...] There was a man who could do it, but there was a big problem.
It used to be like this, once upon a time.

(100MB of raw text like this).

expected ouput:
once upon a 3
upon a time 3
there was a 4

but exclude "had no luck", since it is appearing only twice.

I hope it makes sense.

Thank you

Janis Papanagnou

unread,

Jun 25, 2012, 2:43:44 PM6/25/12

to

Something like

awk '{ gsub(/[^[:alpha:]]+/," "); printf(" %s",tolower($0)) }' your_file |
awk 'BEGIN { SUBSEP=" " }
{ for(i=1;i<=NF-2;i++)m[$i,$(i+1),$(i+2)]++ }
END { for (k in m)if(m[k]>=3)print k,m[k] }'

Janis

jeanb...@gmail.com

unread,

Jun 25, 2012, 2:58:19 PM6/25/12

to

Thanks Janis. I tried it, and there seems to be something wrong with the ouput:

viva viva viva 6
population population population 5
libra libra libra 3

each line is made of the same word appearing three times.
Also, I've got this error message: awk: (FILENAME=- FNR=1566445) fatal: sub_common: buf: can't allocate 0 bytes of memory (Success)

I don't know enough about Linux to see what could be the cause of the problem I'm afraid.

Ed Morton

unread,

Jun 25, 2012, 3:07:41 PM6/25/12

to

With GNU awk to allow for setting RS to an RE instead of just one character:

$ cat tst.awk
BEGIN { RS="[^[:alnum:]]+" }

{ w1 = w2; w2 = w3; w3 = $0 }

NR > 2 { count[tolower(w1 " " w2 " " w3)]++ }

END {
for (phrase in count) {
if (count[phrase] >= 3) {
print phrase, count[phrase]
}
}
}

$ gawk -f tst.awk file
there was a 4
upon a time 3
once upon a 3

The above will fail if "isn't", for example, should be considered to be a "word".

Regards,

Ed.

Janis Papanagnou

unread,

Jun 25, 2012, 3:20:16 PM6/25/12

to

I don't know anything about your environment, what awk you are using,
and what exactly you've tried. When I run my awk command sequence with
your sample data I get:

there was a 4
upon a time 3
once upon a 3

Janis

jeanb...@gmail.com

unread,

Jun 25, 2012, 3:30:11 PM6/25/12

to

Thank you very much Ed, it's working flawlessly. And it's amazingly fast too ! I didn't quite expect that ! Thank you again for taking the time to help me out.

jeanb...@gmail.com

unread,

Jun 26, 2012, 2:01:13 AM6/26/12

to

Le lundi 25 juin 2012 20:43:44 UTC+2, Janis Papanagnou a écrit :

It's working as well Janis. There was a problem with my syntax. Thank you very much !

The problem I'm having right now is to make it work with large files ( > 1GB). Gawk complains it runs out of memory after a while (fatal: more_nodes: nextfree: can't allocate 6400 bytes of memory (Cannot allocate memory). I'll try to fix that and post the results.

Janis Papanagnou

unread,

Jun 26, 2012, 3:44:07 AM6/26/12

to

On 26.06.2012 08:01, jeanb...@gmail.com wrote:
>
> It's working as well Janis. There was a problem with my syntax. Thank you
> very much !

You're welcome.

>
> The problem I'm having right now is to make it work with large files ( >
> 1GB).

(Oh, I thought you had said your data files were about a magnitude of 100k.)

> Gawk complains it runs out of memory after a while (fatal:
> more_nodes: nextfree: can't allocate 6400 bytes of memory (Cannot allocate
> memory).

That is good to know!

Gawk typically has sufficiently high limits, and I wonder why that system
memory limit is triggered here; since Ed's variant seems to require the
same memory as it uses exactly the same storage pattern that I proposed.
The only (but maybe crucial) difference is that my variant works on one
line (entities are conceptually FS separated) while the variant works on
entities one per line (RS separated), which may be more robust if awk's
input line length is limited. (The drawback with the variant is that it's
GNU awk specific.)

> I'll try to fix that and post the results.

Let us know about all insights you get.

Janis

Janis Papanagnou

unread,

Jun 26, 2012, 3:45:24 AM6/26/12

to

On 26.06.2012 09:44, Janis Papanagnou wrote:
> On 26.06.2012 08:01, jeanb...@gmail.com wrote:
>> The problem I'm having right now is to make it work with large files ( >
>> 1GB).
> (Oh, I thought you had said your data files were about a magnitude of 100k.)

I meant to write "100M".

> Janis

jeanb...@gmail.com

unread,

Jun 26, 2012, 4:26:29 AM6/26/12

to

Initially, yes, I wanted to run the script against files < 1GB. But then I thought why not think bigger than that?
One solution I was given would be to use python instead and "store the keys in a database, so I don't need to keep them in memory". I'm going to look deeper into that and will let you know.

Ed Morton

unread,

Jun 26, 2012, 9:16:19 AM6/26/12

to

You could instead use awk to generate the groups of 3 words and then pipe the
output to sort and then uniq -c then to awk again. Something like this:

$ cat tst.awk
BEGIN { RS="[^[:alnum:]]+" }

{ w1 = w2; w2 = w3; w3 = tolower($0) }
NR > 2 { print w1, w2, w3 }

$ awk -f tst.awk file | sort | uniq -c | awk '$1>2'
3 once upon a
4 there was a
3 upon a time

Regards,

Ed.

Barry Margolin

unread,

Jun 26, 2012, 10:12:39 AM6/26/12

to

In article <jsccn2$ip6$1...@dont-email.me>,

If the original file is 1GB, you're going to sort about 3GB. You think
that will be better than putting the words in a DB?

--
Barry Margolin, bar...@alum.mit.edu
Arlington, MA
*** PLEASE post questions in newsgroups, not directly to me ***

jeanb...@gmail.com

unread,

Jun 26, 2012, 9:25:20 AM6/26/12

to

I'm going to give it a try Ed, thanks. Somebody had mentioned this on stackoverflow but without giving too many details. I will let you know. You may be my saviour :)

Ed Morton

unread,

Jun 26, 2012, 11:11:55 AM6/26/12

to

"better" is relative. Personally I wouldn't consider a DB for something like
this until I'd tried a few simpler options to see if they'd be adequate.

Ed.

Ed Morton

unread,

Jun 26, 2012, 12:48:07 PM6/26/12

to

You actually don't NEED the uniq -c piped to awk for that functionality, you
could just do the post-sort part in awk again as:

$ awk -f tst.awk file | sort |

awk '
$0 != p { if (c>2) print p, c; c=0 }
{p=$0; c++}
END { if (c>2) print p, c }
'

but the uniq -c and pipe might turn out to be faster.

Faster still might be adding another sort and then exiting from the final awk
when the count drops below 3, e.g.:

$ awk -f tst.awk file | sort | uniq -dc | sort -rn | awk '$1<3{exit}1'

4 there was a
3 upon a time

3 once upon a

I also threw in the "-d" flag to uniq to drop lines that only appear once just
to give whatever tool follows it less input to process.

I'd be interested in seeing the result of running "time" on some or all of the
above options for your real input if you wouldn't mind posting those results.

Regards,

Ed.

Posted using www.webuse.net

jeanb...@gmail.com

unread,

Jun 26, 2012, 4:25:11 PM6/26/12

to

Ed: so far so good, it seems to be working this time. I'm using the first solution you posted with a 30GB file as the input. This is a bit crazy but I've got a big hard drive and time on my hands. I think I will need at least 300GB free (3 words, thus input file x3 + each word can be in three different positions, resulting in three different combinations: input file x3). Am I correct?

The process was started exactly 6 hours ago. Total sort temp files at this time:70GB.

I will post the results tomorrow night. I hope the system won't crash by then. But I'm using Linux, so that should be okay ;)

Thanks again for your help, Ed, and thank you Janis.

I'll try the other solution later, to print out two-word combinations this time.

Randal L. Schwartz

unread,

Jun 26, 2012, 4:58:21 PM6/26/12

to

>>>>> "jeanbarloy" == jeanbarloy <jeanb...@gmail.com> writes:

jeanbarloy> Le lundi 25 juin 2012 20:07:31 UTC+2, Ed Morton a écrit :
>> On 6/25/2012 12:55 PM, Jean wrote:
>> > Greetings unix gurus !
>> >
>> > I've got several ebooks in .txt format (in which words are separated
>> > by one or several spaces) and I am trying to produce linguistic
>> > statistics out of them.
>> > I'd like to know if there is a simple way with grep - or any other
>> > utility - to print out the most frequent associations of 3 words that
>> > appear within my textual corpus (by "most frequent", I mean: which
>> > appear at least 3 times throughout the whole corpus).
>> >
>> > Thank you very much !
>>
>> awk is the tool you want but please post some sample input and expected output
>> as the statements above could be interpreted in various different ways.
>>
>> Ed.

jeanbarloy> Thank you for answering me Ed.

jeanbarloy> sample input:

jeanbarloy> Once upon a time, there was a princess who had no luck. She had no luck and nobody knew why. There was a prince, though, who...
jeanbarloy> The phrase "once upon a time" is commonly used in fairy tales [...] There was a man who could do it, but there was a big problem.
jeanbarloy> It used to be like this, once upon a time.

jeanbarloy> (100MB of raw text like this).

jeanbarloy> expected ouput:
jeanbarloy> once upon a 3
jeanbarloy> upon a time 3
jeanbarloy> there was a 4

$ perl -e '
@x = map lc, split /\W+/, join "", <>;
while (@x >= 3) {
$c{"@x[0,1,2]"}++;
shift @x;
}
for (sort keys %c) {
next if $c{$_} < 3;
print "$_ $c{$_}\n";
}
' <<'YOURTEXT'

Once upon a time, there was a princess who had no luck. She had no luck and nobody knew why. There was a prince, though, who...
The phrase "once upon a time" is commonly used in fairy tales [...] There was a man who could do it, but there was a big problem.
It used to be like this, once upon a time.

YOURTEXT
once upon a 3
there was a 4
upon a time 3
$

Fits the bill. You're welcome.

print "Just another Perl hacker,"; # the original

--
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<mer...@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
Smalltalk/Perl/Unix consulting, Technical writing, Comedy, etc. etc.
See http://methodsandmessages.posterous.com/ for Smalltalk discussion

Ed Morton

unread,

Jun 26, 2012, 5:06:18 PM6/26/12

to

jeanb...@gmail.com <jeanb...@gmail.com> wrote:

> Ed: so far so good, it seems to be working this time. I'm using the first
solution you posted with a 30GB file as the input. This is a bit crazy but I've
got a big hard drive and time on my hands. I think I will need at least 300GB
free (3 words, thus input file x3 + each word can be in three different
positions, resulting in three different combinations: input file x3). Am I correct?

I don't think so as not every word appears in combination with every other word,
but maybe I don't understand your meaning. Anyway, I look forward to seeing the
results!

Ed.

Posted using www.webuse.net

Message has been deleted

Barry Margolin

unread,

Jun 26, 2012, 5:57:37 PM6/26/12

to

In article <201206262...@webuse.net>,

If the input file contains:

Now is the time for all good men

the temp file will contain:

Now is the
is the time
the time for
time for all
for all good
all good men

So for each word in the file (except the last 2) there's a line, and
that line contains 3 words. So the temp file contains (n-2)*3 words.

John W. Krahn

unread,

Jun 26, 2012, 8:26:08 PM6/26/12

to

jeanb...@gmail.com wrote:
>
> sample input:
>
> Once upon a time, there was a princess who had no luck. She had no luck and nobody knew why. There was a prince, though, who...
> The phrase "once upon a time" is commonly used in fairy tales [...] There was a man who could do it, but there was a big problem.
> It used to be like this, once upon a time.
>
> (100MB of raw text like this).
>
> expected ouput:
> once upon a 3
> upon a time 3
> there was a 4

$ echo "Once upon a time, there was a princess who had no luck. She had

no luck and nobody knew why. There was a prince, though, who...
The phrase "once upon a time" is commonly used in fairy tales [...]
There was a man who could do it, but there was a big problem.
It used to be like this, once upon a time.

" | perl -lne'
s/[[:punct:]]//g;
$data{"\L$1 $2 $3"}++ while /(\S+)(?=\s+(\S+)\s+(\S+))/g
}{
$data{$_} > 2 && print "$_ $data{$_}" for keys %data
'

upon a time 3
there was a 4

once upon a 3

John
--
Any intelligent fool can make things bigger and
more complex... It takes a touch of genius -
and a lot of courage to move in the opposite
direction. -- Albert Einstein

jeanb...@gmail.com

unread,

Jun 27, 2012, 7:41:41 AM6/27/12

to

Hello Ed,

Here's the result: it took 13 hours to process a 28GB file. That's not too bad I think :)
I was wondering something: in NR > 2 { print w1, w2, w3 } , what does "NR > 2" mean?

Thank you,

Jean

Janis Papanagnou

unread,

Jun 27, 2012, 7:45:36 AM6/27/12

to

Do that print action only if that condition ("at least on line 3")
is true.
This is necessary because you want to compare triples; while the
first two lines are processed there is not yet a triple available.

Janis

>
> Thank you,
>
> Jean
>

jeanb...@gmail.com

unread,

Jun 27, 2012, 7:45:40 AM6/27/12

to jwk...@shaw.ca

Thanks John, I'll give it a try and see if it can handle big files.

jeanb...@gmail.com

unread,

Jun 27, 2012, 7:44:45 AM6/27/12

to

Thank you Randal, I'll try it out as soon as I can

jeanb...@gmail.com

unread,

Jun 27, 2012, 8:16:05 AM6/27/12

to

It makes sense, thank you very much !

Laurianne Gardeux

unread,

Jun 28, 2012, 5:09:03 PM6/28/12

to

jeanbarloy:

> Here's the result: it took 13 hours to process a 28GB file. That's not
> too bad I think :)

And now, whitch are the most frequent associations of 3 words on
en.wikipedia? ;)

LG, just curious

Laurianne Gardeux

unread,

Jun 28, 2012, 5:17:48 PM6/28/12

to

jeanbarloy:

> Here's the result: it took 13 hours to process a 28GB file. That's not
> too bad I think :)

And now, which are the most frequent associations of 3 words on