Perl and Factor versions of a module to supply random Latin text.

7 views

Skip to first unread message

Terrence Brannon

unread,

Feb 22, 2010, 11:40:32 AM2/22/10

to squatting...@googlegroups.com, po...@posterous.com

Awhile back, I saw the Data::Maker package on CPAN. I noticed that it made
use of the Text::Lorem package. Being unemployed, I tend to try to teach
myself things other than Perl.

After feeling the pain of trying to write Text::Lorem in Prolog, Clojure,
and Talend, I somehow remembered Factor. I spent hours and days of
teeth-pulling in those other technologies.

This was a single-day task in Factor, where most of the day was spent
following some tutorials and installing Factor.

The full source code for both versions can be viewed at:

http://gitorious.org/project_factor/lorem/blobs/master/lorem.factor
http://gitorious.org/text-lorem/text-lorem/blobs/master/lib/Text/Lorem.pm

What follows is a comparison of the Perl and Factor programs which generate
random text that looks like Latin.

> package Text::Lorem;
>
> use strict;
> use warnings;
> use vars qw($VERSION);
>
> $VERSION = "0.3";

Factor doesnt have a versioning system yet, or a CPAN. Basically the
'extra' directory in the distribution grows as people contribute.

>
> my $lorem_singleton;
>
> sub new {
> my $class = shift;
> $lorem_singleton ||= bless {},$class;
> return $lorem_singleton;
> }

The Perl class insures that only one instance of Text::Lorem is ever
created. My Factor vocabulary (their term for module) is
module-oriented, not object-oriented, so there's no need for singleton
object creation.

Yes Factor does have a CLOS-inspired object system, but that's a topic for
another time.

>
> sub generate_wordlist {
> my $self = shift;
> [ map { s/\W//; lc($_) }split(/\s+/, <DATA>) ];
> }

: clean ( -- string ) text >lower R/ [^\sA-Za-z]/ "" re-replace ;

Here the sample ipsum lorem text is split into words, cleaned and lowercased

>
> sub wordlist {
> my $self = shift;
> $self->{ wordlist } ||= $self->generate_wordlist();
> }

MEMO: wordlist ( -- array ) clean R/ \s+/ re-split [ >string ] map ;

This word takes the clean ipsum text, generates the wordlist, leaves
the array on the stack and memoizes it so that subsequent calls for
the wordlist execute in O(1) time instead of re-cleaning and splitting.

>
> sub wordcount {
> my $self = shift;
> return scalar(@{$self->{ wordlist }});
> }

Wordcount is a function designed to help random indexing into the
wordlist. The random vocabulary in Factor is similar to Python in that
it has sampling functions so you dont have to do this sort of thing: you
just pass the sampling function the array and the number of elements you
want and you are done.

Notice that wordcount() would be better off memoized. There's no need for
O(N) counting of the amount of words in a statically computed list. Perhaps
the author should have hauled in Memoize so that both his wordlist() and
wordcount () functions could benefit from it instead of using his homegrown
memoization.

Either that or wordcount should have been set when constructing the
wordlist:

$self->wordcount (...)

So that it would only be accessed (not computed) later.

>
> sub get_word {
> my $self = shift;
> return $self->wordlist->[ int( rand( $self->wordcount ) ) ];
> }

: getword ( -- string ) wordlist random ;

random is used when you want 1 element of a list. For getting a variable
range of elements, you use the sample word, as we shall see.

Notice, how Perl is constantly getting arguments and accessing
slots. Because Factor is stack-based, each word simply consumes what
it expects from the stack.

So, a lot of times, there is less argument fiddling. But you will find
yourself doing stack fiddling in Factor, such as when I call swap below to
switch the order of stack elements before calling sample.

>
> sub words {
> my $self = shift;
> my $num = shift;
> my @words;
> push @words, $self->get_word() for (1..$num);
> return join(' ', @words);
> }
>

: getwords ( n -- array ) wordlist swap sample ;

The Perl routine is named "words" yet it returns a string of words. I dont
think this is properly factored. I think this routine should've
returned an array of words. But I understand the author's intention. He
wanted to make a simple public API and most people are going to want a
string of words, not an array.

Anyway, it is supposed to get n random words. In Factor's random
vocabulary, the sample word expects an array and a number IN THAT
ORDER. Because the number is already on the stack, we place the
wordlist on the stack and then call swap so they are in the right
order for sample and we are done.

You almost wish there were some sort of stack-fiddling shorthand so that
getwords could be written as:

: getwords ( n -- array ) wordlist sample(1,0) ;

Where 2,1 indicates that positions 0 and 1 on the stack need to be
switched. Just an idea.

Another idea is to overload the stack signature:

: getwords ( n -swap- array ) wordlist sample ;

Indicating that a swap should happen after wordlist is added to the stack.

> sub get_sentence {
> my $self = shift;
> my $words = $self->words( 4 + int( rand( 6 ) ) );
> ucfirst( $words );
> }
>

: sentence ( -- string ) sentencestring ucfirst "." append ;

This Factor code reads basically like English: you take a
sentencestring, ucfirst it, then append a period!

Factor behooves you to write small understandable words like
that. Here are the auxilliary words that sentence used:

: sentencestring ( -- string ) sentencewords " " join ;
: sentencewords ( -- array ) 4 10 [a,b] random getwords ;
: ucfirst ( string -- string ) 1 cut [ >upper ] dip append ;

So while Perl had 1 subroutine, it was natural in Factor to decompose
it into 4 words.

Just another brief note. "[a,b]" above is nothing special. It's just a
word. Like lisp, many characters not commonly encountered in identifiers are
acceptable as word characters in Factor.

> sub sentences {
> my $self = shift;
> my $num = shift;
> my @sentences;
> push @sentences, $self->get_sentence for (1..$num);
> join( '. ', @sentences ) . '.';
> }

: sentences ( n -- array ) [ sentence ] replicate ;

A sentence is a string with a ucfirst()ed first character and a
period. So, this code is again not really factored correctly. A
routine to create sentences should return an array of sentences.

The Perl people are seeing "[ sentence ]" and will see that as some
sort of array construct, because Perl anonymous arrays are formed using
square brackets.

However, that is not what is happening. What you see there is a
"quoted word" ... normally when Factor encounters a word, it evaluates
it immediately. So here we are creating a "Perl closure" that we can
call many times and collect the output into an array (that's what
replicate does for us).

But again, I also think that the plural nature of the subroutine
implies that an array of sentences should be returned. Because an
array of sentences is much more tractable for later manipulation than
a joined string.

As the saying goes, once you make chocolate milk, it's hard to get the
chocolate or milk out later!

>
> sub get_paragraph {
> my $self = shift;
> my $sentences = $self->sentences(3 + int( rand( 4 ) ) );
>
> }

: paragraph ( -- string ) 3 7 [a,b] random sentences " " join ;

Wow, is this English or what: "give me between 3 and 7 random
sentences joined by space"

>
> sub paragraphs {
> my $self = shift;
> my $num = shift;
> my @paragraphs;
> push @paragraphs, $self->get_paragraph for (1..$num);
> join( "\n\n", @paragraphs );
> }