Parallel words frequency ranking

Piotr 'Qertoip' Włodarek

unread,

Dec 28, 2008, 10:50:35 PM12/28/08

to Clojure

Following my recent adventure with words ranking, here's the parallel
version:

(use 'clojure.contrib.duck-streams)

(defn top-words-core [s]
(reduce #(assoc %1 %2 (inc (%1 %2 0))) {}
(re-seq #"\w+"
(.toLowerCase s))))

(defn format-words [words]
(apply str
(map #(format "%20s : %5d \r\n" (key %) (val %))
(sort-by #(- (val %))
words))))

(defn split-string-in-two [s]
(let [chunk-size (quot (count s) 2)]
[(subs s 0 chunk-size), (subs s chunk-size)]))

(defn parallel-top-words [in-filepath out-filepath]
(let [string (slurp in-filepath)
agents (map #(agent %) (split-string-in-two string))]

(doseq [a agents] (send a top-words-core))
(apply await agents)

(spit out-filepath
(format-words
(apply merge-with + (map deref agents))))))

(http://pastie.org/348106)

On 38MB file it takes 28s, compared to 38s of similar but sequential
version.

1. Is there a better way to do it? Perhaps agents should share some
data structure?
2. Despite producing valid results, the program never ends. Why?

regards,
Piotrek

Chouser

unread,

Dec 29, 2008, 1:02:36 AM12/29/08

to clo...@googlegroups.com

On Sun, Dec 28, 2008 at 10:50 PM, Piotr 'Qertoip' Włodarek
<qer...@gmail.com> wrote:
>
> Following my recent adventure with words ranking, here's the parallel
> version:
>
> (use 'clojure.contrib.duck-streams)

Thanks for including your 'use' line -- that's so much better than
leaving it implied. Please also consider using 'require' instead, or
the :only option to 'use' to make it clear which functions are being
brought in from which lib.

> (defn split-string-in-two [s]
> (let [chunk-size (quot (count s) 2)]
> [(subs s 0 chunk-size), (subs s chunk-size)]))

Might this cut a word in half and produce (slightly) incorrect
results?

> 1. Is there a better way to do it? Perhaps agents should share some
> data structure?

I've got lots to learn in the realm of parallel program design, but
the sequential summing step at the end stood out to me. Perhaps
another agent that just does incremental summing along the way would
reduce the running time.

Also, the entire file is read in by a single thread -- perhaps the use
of mmap would allow the agents to start counting sooner and reduce
total run time that way. You may want to look at clojure.contrib.mmap

> 2. Despite producing valid results, the program never ends. Why?

When a program uses agents, the top-level of the program is
responsible for determining that all the agents have done what they
need to, and that the program can terminate. At the REPL this can
generally be done with Ctrl-D. In a Script, use (shutdown-agents). I
imagine this ought to be done by the same level of code that calls
parallel-top-words, since p-t-w itself can't know that no other agent
work is being done.

--Chouser

Mark H.

unread,

Dec 30, 2008, 12:54:48 AM12/30/08

to Clojure

On Dec 28, 7:50 pm, Piotr 'Qertoip' Włodarek <qert...@gmail.com>
wrote:

> Following my recent adventure with words ranking, here's the parallel
> version:
>

> ...

> (defn parallel-top-words [in-filepath out-filepath]
> (let [string (slurp in-filepath)

'slurp' just reads the whole file in at once as a string, right? You
could be bottlenecking on the file access and read. Can you wrap this
step only with a timer and see how long it takes?

mfh

Piotr 'Qertoip' Włodarek

unread,

Dec 30, 2008, 8:56:14 AM12/30/08

to Clojure

On Dec 29, 7:02 am, Chouser <chou...@gmail.com> wrote:
> > (defn split-string-in-two [s]
> > (let [chunk-size (quot (count s) 2)]
> > [(subs s 0 chunk-size), (subs s chunk-size)]))
>
> Might this cut a word in half and produce (slightly) incorrect
> results?

True, I decided to let it be for the sake of simplicity.

> > 1. Is there a better way to do it? Perhaps agents should share some
> > data structure?
>
> I've got lots to learn in the realm of parallel program design, but
> the sequential summing step at the end stood out to me. Perhaps
> another agent that just does incremental summing along the way would
> reduce the running time.
>
> Also, the entire file is read in by a single thread -- perhaps the use
> of mmap would allow the agents to start counting sooner and reduce
> total run time that way. You may want to look at clojure.contrib.mmap

Thanks for suggestion.

In fact, mmap/slurp is *much* faster than built-in slurp.

Now, my program starts with:

(ns com.wlodarek.examples
(:require [clojure.contrib.mmap])
(:require [clojure.contrib.duck-streams]))

(defn read-file [filepath]
(str (clojure.contrib.mmap/slurp filepath)))

(defn write-file [filepath string]
(clojure.contrib.duck-streams/spit filepath string))

; ...

I hope Clojure will soon have fast, reasonably named read/write file
functions built in the core.

Now I'm trying to parallelize also the formatting part, and to move
from 2 agents to n agents.

Again, thanks for all tips!

regards,
Piotrek

Mibu

unread,

Dec 30, 2008, 12:18:07 PM12/30/08

to Clojure

In an ideal world, standard functions like map, sort, reduce, filter,
etc. would know when to parallelize on their own, or even better, the
compiler will do it for them.

Meanwhile, I tried playing with the parallel lib ( http://clojure.org/other_libraries
) which has functions like preduce, psort, and par:map functions, but
I wasn't very successful. Anyone who knows what s/he's doing can shed
some light on how to use it with this example?

Mibu

P.S. psort and sort's arguments are inconsistent (psort [coll comp]
vs. sort [comp coll]).

On Dec 30, 3:56 pm, Piotr 'Qertoip' Włodarek <qert...@gmail.com>
wrote:

Mark H.

unread,

Dec 30, 2008, 4:24:23 PM12/30/08

to Clojure

On Dec 30, 9:18 am, Mibu <mibu.cloj...@gmail.com> wrote:
> In an ideal world, standard functions like map, sort, reduce, filter,
> etc. would know when to parallelize on their own, or even better, the
> compiler will do it for them.

The former is easier than the latter ;-) Even the smartest
autoparallelizing compilers rely on manual annotations to expose lack
of sequential dependencies in loops, but a function's API can
guarantee that parallelism and let the function's implementation do
the heavy lifting.

It takes so much effort to write a "good" (read "aggressively
optimizing") compiler that major releases come out on the same
schedule as major CPU releases, which means users don't get time to
work out all the bugs on the current (hardware,software) combination
before the next compiler and CPU come out. Besides, compiler bugs are
incredibly frustrating for users (trust me). This means it's better
to have simpler compilers and smarter libraries.

mfh

Mark H.

unread,

Dec 30, 2008, 4:35:08 PM12/30/08

to Clojure

On Dec 28, 7:50 pm, Piotr 'Qertoip' Włodarek <qert...@gmail.com>
wrote:

> On 38MB file it takes 28s, compared to 38s of similar but sequential
> version.

Another good thing is to make a simple performance model, perhaps
aided by timings of individual components, before you start
parallelizing everything ;-) How long does it take just to read in
the file (with or without mmap)? How long does it take one agent to
finish its count? How long does the reduction over agents' counts at
the end take? If you're concerned about the latter being a sequential
sum, how does its performance vary as a function of the number of
agents?

The total performance of your implementation should be

(time for slurp) + (time for one agent to count its words) + (time for
reduction over agents),

not counting the time to start up the agents. Unless you have a RAID
array or a serious parallel filesystem, (time for slurp) is just a
fixed overhead you have to eat. btw how many processors do you have?

mfh

Konrad Hinsen

unread,

Dec 31, 2008, 6:03:56 AM12/31/08

to clo...@googlegroups.com

On 30.12.2008, at 22:24, Mark H. wrote:

> On Dec 30, 9:18 am, Mibu <mibu.cloj...@gmail.com> wrote:
>> In an ideal world, standard functions like map, sort, reduce, filter,
>> etc. would know when to parallelize on their own, or even better, the
>> compiler will do it for them.
>
> The former is easier than the latter ;-) Even the smartest
> autoparallelizing compilers rely on manual annotations to expose lack
> of sequential dependencies in loops, but a function's API can
> guarantee that parallelism and let the function's implementation do
> the heavy lifting.

Manual intervention is required not only for the identification of
dependencies, but also for telling the compiler where parallelization
makes sense for gaining efficiency. It is often said that pure
functional programs have the advantage of being autoparallelizable
because there are no hidden dependencies, but there still isn't any
decent autoparallelizing compiler for any functional language at the
moment. One of the reasons is that performance analysis is still very
difficult. Perhaps one day we will have parallel JIT compilers for
this, but that's not for tomorrow.

> incredibly frustrating for users (trust me). This means it's better
> to have simpler compilers and smarter libraries.