Markov Chain generator in Clojure

70 views
Skip to first unread message

Luke VanderHart

unread,
Dec 14, 2008, 3:42:13 PM12/14/08
to Clojure Study Group Washington DC
I have written a Markov Chain generator in Clojure. I'm really pleased
with how it turned out, given that it's the first non-trivial program
I've written in it. (If you don't know what a Markov Chain is, you can
read about it at http://en.wikipedia.org/wiki/Markov_chain). It
supports chaining based on letters or on words - both yield
interesting results. Just run it on any sufficiently large corpus
text, and you'll get output that mimics the style. I've run it on
Plato, Mark Twain and Jane Austen, and it's really interesting to
watch the different styles represented through pure randomness/
statistics.

The first iteration took me 4-5 hours, and then I took about that much
time again and re-wrote it to be fully lazy and decreased the size of
the code base by 50%, and far more readable and elegant. It performs
better, too.

I love the language - as compared with Java, it is so much fun to
write. It takes more effort for me to reason about, but all the code I
write is absolutely part of the program logic - I hadn't realized how
much boilerplate Java had until I thought about the Java
implementation of it.

I've uploaded it to the newsgroup here:
http://clojure-study-dc.googlegroups.com/web/markov.clj?gsc=qogOMBYAAADbtp3y7mUg4RbsV-ZUOeeBS7ibph5ftdNh9K_-frBgDg

Please, give it a shot if you have the time, and let me know what you
think. I'm particularly interested on comments on my coding style...
Am I using lisp idioms correctly?

Many thanks,
-Luke

Craig Andera

unread,
Dec 15, 2008, 10:34:00 AM12/15/08
to clojure-...@googlegroups.com
Inexpert observations:

* It probably makes sense to turn this into a lib with an ns form at
the top. My experience has been that you need to do that to be able to
debug, plus the niceness of use/require is convenient.
* Have you tested char-seq and word-seq against really large inputs?
The apparently recursive calls in there make me nervous: see [1].
* Have you looked at clojure.contrib.seq-utils? It seems like
partition-by might make your word-seq function obsolete.
* There have been some comments on the clojure list recently around
the difference between streams and seqs that might be relevant.

[1]
This function blows the heap when called via (count (down-from
20000000)). I would expect it to blow the *stack*, though: perhaps
count retains the head somehow. Also, I don't think I understand
exactly how lazy-cons works.

(defn down-from [x]
(if (> x 0)
(lazy-cons x (down-from (dec x)))))

Luke VanderHart

unread,
Dec 15, 2008, 3:35:14 PM12/15/08
to Clojure Study Group Washington DC
Thanks for the comments...

> * It probably makes sense to turn this into a lib with an ns form at
> the top. My experience has been that you need to do that to be able to
> debug, plus the niceness of use/require is convenient.

Will do.

> * Have you tested char-seq and word-seq against really large inputs?
> The apparently recursive calls in there make me nervous: see [1].

Yes... I've loaded entire 500-page novels, with index sizes up to half
a million entries. Also, the char-seq function is a slightly modified
clone of the line-seq function in the Clojure standard lib. I am not
100% sure I understand it completely, but to the best of my knowledge,
it works because lazy-cons is a macro. The nested call to char/word-
seq is not called recursively, rather, the call to it actually creates
a closure that is stored and then lazily called when required by the
seq. So it doesn't add to the stack. I would expect that a count of 20
million would blow the heap, unless you've assigned Java a lot of heap
space.

Takes only 10-15 seconds to build the index, too. I'm quite impressed
with Clojure's speed in that regard. And index lookups are fast,
despite using seqs as keys.

I wasn't aware of the seq-utils library... It does seem as if the
contrib library needs better documentation. I can't find anything
about what's available except what people happen to mention on the
newsgroup, short of reading the source itself.

Thanks for the comments! I appreciate it.

-Luke

On Dec 15, 10:34 am, "Craig Andera" <cand...@wangdera.com> wrote:
> Inexpert observations:
>
> * It probably makes sense to turn this into a lib with an ns form at
> the top. My experience has been that you need to do that to be able to
> debug, plus the niceness of use/require is convenient.
> * Have you tested char-seq and word-seq against really large inputs?
> The apparently recursive calls in there make me nervous: see [1].
> * Have you looked at clojure.contrib.seq-utils? It seems like
> partition-by might make your word-seq function obsolete.
> * There have been some comments on the clojure list recently around
> the difference between streams and seqs that might be relevant.
>
> [1]
> This function blows the heap when called via (count (down-from
> 20000000)). I would expect it to blow the *stack*, though: perhaps
> count retains the head somehow. Also, I don't think I understand
> exactly how lazy-cons works.
>
> (defn down-from [x]
>   (if (> x 0)
>     (lazy-cons x (down-from (dec x)))))
>
> On Sun, Dec 14, 2008 at 3:42 PM, Luke VanderHart
>
> <luke.vanderh...@gmail.com> wrote:
>
> > I have written a Markov Chain generator in Clojure. I'm really pleased
> > with how it turned out, given that it's the first non-trivial program
> > I've written in it. (If you don't know what a Markov Chain is, you can
> > read about it athttp://en.wikipedia.org/wiki/Markov_chain). It
> > supports chaining based on letters or on words - both yield
> > interesting results. Just run it on any sufficiently large corpus
> > text, and you'll get output that mimics the style. I've run it on
> > Plato, Mark Twain and Jane Austen, and it's really interesting to
> > watch the different styles represented through pure randomness/
> > statistics.
>
> > The first iteration took me 4-5 hours, and then I took about that much
> > time again and re-wrote it to be fully lazy and decreased the size of
> > the code base by 50%, and far more readable and elegant. It performs
> > better, too.
>
> > I love the language - as compared with Java, it is so much fun to
> > write. It takes more effort for me to reason about, but all the code I
> > write is absolutely part of the program logic - I hadn't realized how
> > much boilerplate Java had until I thought about the Java
> > implementation of it.
>
> > I've uploaded it to the newsgroup here:
> >http://clojure-study-dc.googlegroups.com/web/markov.clj?gsc=qogOMBYAA...

Keith Bennett

unread,
Dec 15, 2008, 3:40:55 PM12/15/08
to clojure-...@googlegroups.com
Guys -

I'm a newbie to Lisp and Clojure, but for what it's worth, I remember
while studying Erlang that they had a way to make recursive calls
nonrecursive. So the code looked like it was recursive, but under the
covers, that's not how it was implemented. Maybe Clojure's doing
something similar?

- Keith

Craig Andera

unread,
Dec 15, 2008, 4:15:24 PM12/15/08
to clojure-...@googlegroups.com
Unfortunately, the JVM does not support tail-call optimization (TCO),
which you need in order to make recursion not eat the stack. So
Clojure doesn't support it either. Instead, we get two things that are
reasonable replacements: recur for self-recursion and trampoline for
mutual recursion. Trampoline is pretty new - about a week old, IIRC.

Craig Andera

unread,
Dec 15, 2008, 4:17:30 PM12/15/08
to clojure-...@googlegroups.com
> Yes... I've loaded entire 500-page novels, with index sizes up to half
> a million entries. Also, the char-seq function is a slightly modified
> clone of the line-seq function in the Clojure standard lib. I am not
> 100% sure I understand it completely, but to the best of my knowledge,
> it works because lazy-cons is a macro. The nested call to char/word-
> seq is not called recursively, rather, the call to it actually creates
> a closure that is stored and then lazily called when required by the
> seq. So it doesn't add to the stack. I would expect that a count of 20
> million would blow the heap, unless you've assigned Java a lot of heap
> space.

OK, if you tested it, then that's cool. I did look at lazy-cons, and I
*think* I understand it, but it hasn't really sunk in. :) Your
explanation jibes with my understanding, however.

> I wasn't aware of the seq-utils library... It does seem as if the
> contrib library needs better documentation. I can't find anything
> about what's available except what people happen to mention on the
> newsgroup, short of reading the source itself.

Agreed. Which is why I make it a point to read the source. :) Since
Clojure is so concise, it's a worthwhile 5 minutes a few times a week
to peruse what's in the clojure-contrib tree. I should probably be
doing the same thing with the clojure sources themselves.

I think Stu's book is going to have a summary of useful functions in
it somewhere, too.

Reply all
Reply to author
Forward
0 new messages