performance question

1 view
Skip to first unread message

Dmitri

unread,
Dec 13, 2008, 10:41:54 AM12/13/08
to Clojure
I wrote a simple word counter described here http://ptrace.fefe.de/wp/
it reads stdin and counts the occurrences of words, however I notice
that it runs significantly slower than the java version in the link.

I was wondering why there is such a dramatic difference. The approach
I took was to create a map keyed on words and use the occurrence count
as the value. When each line is read from input it's tokenized and the
word counts are updated. The slowdown seems to occur in the inc-count
function, where it "updates" the map using the assoc. Is this not a
proper way to approach this in clojure?

I've also noticed that there is a significant speed difference between
conj and assoc, why is that?
If I understand correctly both should only create the delta of the new
elements and the old structure, however assoc appears to perform much
better.

(import '(java.io BufferedReader InputStreamReader))

(defn inc-count [words word]
(if (= (. word (length)) 0)
words
(let [cnt (get words word)]
(if cnt (assoc words word (inc cnt))
(assoc words word 1)))))

(defn sort-words [words]
(reverse (sort-by (fn [x] (first x))
(map (fn [x] [(get words x) x])
(keys words)))))

(defn print-words [words]
(let [head (first words) tail (rest words)]
(if head
(do
(println head)
(recur tail)))))

(defn read-words [words line]
(let [head (first line) tail (rest line)]
(if (nil? tail) words
(recur (time (inc-count words head)) tail))))

(defn read-input []
(with-open [stream (System/in)]
(let [buf (BufferedReader. (InputStreamReader. stream))]
(loop [line (. buf (readLine)) words {}]
(if (nil? line)
(print-words (sort-words words))
(recur (. buf (readLine)) (read-words words (. line (split "
")))))))))

(time (read-input))

Michel Salim

unread,
Dec 13, 2008, 11:02:06 AM12/13/08
to Clojure
On Dec 13, 10:41 am, Dmitri <dmitri.sotni...@gmail.com> wrote:
> I wrote a simple word counter described herehttp://ptrace.fefe.de/wp/
> it reads stdin and counts the occurrences of words, however I notice
> that it runs significantly slower than the java version in the link.
>
> I was wondering why there is such a dramatic difference. The approach
> I took was to create a map keyed on words and use the occurrence count
> as the value. When each line is read from input it's tokenized and the
> word counts are updated. The slowdown seems to occur in the inc-count
> function, where it "updates" the map using the assoc. Is this not a
> proper way to approach this in clojure?
>
> I've also noticed that there is a significant speed difference between
> conj and assoc, why is that?
> If I understand correctly both should only create the delta of the new
> elements and the old structure, however  assoc appears to perform much
> better.
>
It's doing that, but that is still less performant than a mutable map.

Another thing you can try doing is not reversing -- why not sort by
(fn [x] (- (first x)) ?

--
Michel

Jeremy Dunck

unread,
Dec 13, 2008, 12:38:32 PM12/13/08
to Clojure


On Dec 13, 9:41 am, Dmitri <dmitri.sotni...@gmail.com> wrote:
...
> The slowdown seems to occur in the inc-count
> function, where it "updates" the map using the assoc. Is this not a
> proper way to approach this in clojure?

(recur (time (inc-count words head)) tail))))

You're pretty tightly looping here-- are you sure the overhead isn't
in this extra (time) call rather than (inc-count) itself?

Dmitri

unread,
Dec 13, 2008, 1:12:29 PM12/13/08
to Clojure
I added the time call later on to find what was taking up the cycles,
I also checked the reverse, it's impact is minimal, the print-words
part of the program runs fast, but the read-words takes the majority
of the time.

Chouser

unread,
Dec 13, 2008, 1:15:53 PM12/13/08
to clo...@googlegroups.com
On Sat, Dec 13, 2008 at 10:41 AM, Dmitri <dmitri....@gmail.com> wrote:
>
> I wrote a simple word counter described here http://ptrace.fefe.de/wp/
> it reads stdin and counts the occurrences of words, however I notice
> that it runs significantly slower than the java version in the link.

There are several differences that could be factors. For example, the
Java version uses StreamTokenizer, while your Clojure version uses
String.split with a regex that gets recompiled for each line read.

> I've also noticed that there is a significant speed difference between
> conj and assoc, why is that?
> If I understand correctly both should only create the delta of the new
> elements and the old structure, however assoc appears to perform much
> better.

user=> (let [c 1000000 p [1 1]] (time (reduce #(conj % [%2 %2]) {}
(range c))) (time (reduce #(assoc % %2 %2) {} (range c))) nil)
"Elapsed time: 1544.180472 msecs"
"Elapsed time: 1894.318809 msecs"
nil
user=> (let [c 1000000 p [1 1]] (time (reduce #(conj % [%2 %2]) {}
(range c))) (time (reduce #(assoc % %2 %2) {} (range c))) nil)
"Elapsed time: 1549.159812 msecs"
"Elapsed time: 1594.18912 msecs"

That's a million items added to a hash-map each way in about 1.5
seconds -- not too shabby. And the speeds for conj vs. assoc seem
very close, though I'm actually seeing a slight advantage for conj.

And I'm sorry for what follows -- it's like a compulsion for me, and I
hope it doesn't put you off. Each of these functions takes the same
input and produces the same output as your original code, but each is
implemented a bit more succinctly:

(import '(java.io BufferedReader InputStreamReader))

(defn inc-count [words word]
(if (seq word)
(assoc words word (inc (words word 0)))
words))

(defn sort-words [words]
(reverse (sort (map (fn [[k v]] [v k]) words))))

(defn print-words [words]
(doseq [head words]
(println head)))

(defn read-words [words line]
(reduce inc-count words line))

(defn read-input []
(with-open [buf (BufferedReader. (InputStreamReader. System/in))]
(let [words (for [line (line-seq buf)] (.split line " "))]
(print-words (sort-words (reduce read-words {} words))))))

(time (read-input))

--Chouser

Dmitri

unread,
Dec 13, 2008, 1:19:19 PM12/13/08
to Clojure
To give an example, I tried running through the Iliad from project
gutenberg, it's roughly 1MB of text http://www.gutenberg.org/files/6130/6130.txt

and the program takes ~4600 ms to run, if I comment out printing of
results it runs in ~3700 ms.
By contrast a java version runs in ~560ms.

Now, obviously I could just use the mutable java hash map, but I'm
curious if there's a functional approach which would be efficient.

On Dec 13, 12:38 pm, Jeremy Dunck <jdu...@gmail.com> wrote:

Dmitri

unread,
Dec 13, 2008, 1:21:45 PM12/13/08
to Clojure
thanks for pointing this out, and I absolutely appreciate the example.
I'm still new to functional approach and I always like to see how
things are done properly.

On Dec 13, 1:15 pm, Chouser <chou...@gmail.com> wrote:
> On Sat, Dec 13, 2008 at 10:41 AM, Dmitri <dmitri.sotni...@gmail.com> wrote:
>
> > I wrote a simple word counter described herehttp://ptrace.fefe.de/wp/

Justin Henzie

unread,
Dec 14, 2008, 12:47:06 PM12/14/08
to Clojure
Nice code chouser, always nice to see a succinct functional example.

On Dec 13, 10:15 am, Chouser <chou...@gmail.com> wrote:
> On Sat, Dec 13, 2008 at 10:41 AM, Dmitri <dmitri.sotni...@gmail.com> wrote:
>
> > I wrote a simple word counter described herehttp://ptrace.fefe.de/wp/
Reply all
Reply to author
Forward
0 new messages