>>>>> "jeanbarloy" == jeanbarloy <
jeanb...@gmail.com> writes:
jeanbarloy> Le lundi 25 juin 2012 20:07:31 UTC+2, Ed Morton a écrit :
>> On 6/25/2012 12:55 PM, Jean wrote:
>> > Greetings unix gurus !
>> >
>> > I've got several ebooks in .txt format (in which words are separated
>> > by one or several spaces) and I am trying to produce linguistic
>> > statistics out of them.
>> > I'd like to know if there is a simple way with grep - or any other
>> > utility - to print out the most frequent associations of 3 words that
>> > appear within my textual corpus (by "most frequent", I mean: which
>> > appear at least 3 times throughout the whole corpus).
>> >
>> > Thank you very much !
>>
>> awk is the tool you want but please post some sample input and expected output
>> as the statements above could be interpreted in various different ways.
>>
>> Ed.
jeanbarloy> Thank you for answering me Ed.
jeanbarloy> sample input:
jeanbarloy> Once upon a time, there was a princess who had no luck. She had no luck and nobody knew why. There was a prince, though, who...
jeanbarloy> The phrase "once upon a time" is commonly used in fairy tales [...] There was a man who could do it, but there was a big problem.
jeanbarloy> It used to be like this, once upon a time.
jeanbarloy> (100MB of raw text like this).
jeanbarloy> expected ouput:
jeanbarloy> once upon a 3
jeanbarloy> upon a time 3
jeanbarloy> there was a 4
$ perl -e '
@x = map lc, split /\W+/, join "", <>;
while (@x >= 3) {
$c{"@x[0,1,2]"}++;
shift @x;
}
for (sort keys %c) {
next if $c{$_} < 3;
print "$_ $c{$_}\n";
}
' <<'YOURTEXT'
Once upon a time, there was a princess who had no luck. She had no luck and nobody knew why. There was a prince, though, who...
The phrase "once upon a time" is commonly used in fairy tales [...] There was a man who could do it, but there was a big problem.
It used to be like this, once upon a time.
YOURTEXT
once upon a 3
there was a 4
upon a time 3
$
Fits the bill. You're welcome.
print "Just another Perl hacker,"; # the original
--
Randal L. Schwartz - Stonehenge Consulting Services, Inc. -
+1 503 777 0095
<
mer...@stonehenge.com> <URL:
http://www.stonehenge.com/merlyn/>
Smalltalk/Perl/Unix consulting, Technical writing, Comedy, etc. etc.
See
http://methodsandmessages.posterous.com/ for Smalltalk discussion