multiple agents yielding flat performance

32 views
Skip to first unread message

tmountain

unread,
Jun 15, 2009, 6:22:33 PM6/15/09
to Clojure
Having learned about agents recently, I've created a pretty contrived
program, which I believed would easily lend itself to parallel
processing. The program takes a list of MD5 sums and then does brute
force comparison the find the corresponding four character strings
they were generated from. The brute force character generation and
MD5'ing are CPU intensive, so I assumed I'd see a big performance
improvement by spawning one agent per CPU and then dividing the
workload evenly across them.

I'm testing on a quad-core box, and I'm seeing virtually identical
performance numbers switching between one agent and four agents. I've
also tried larger numbers like 12 agents, but the results are the
same. The load average also stays the same hovering around 1.0, and
the execution time is static at approximately 15 seconds. To make sure
Clojure, Java, and agents were at least working, I also tested a very
small program to just do an infinite loop per agent, and it instantly
maxed out all four CPUs, so I know agents work ;-).

(ns agentmd5
(:refer-clojure)
(:import
(java.security
NoSuchAlgorithmException
MessageDigest)
(java.math BigInteger)))

; computes an md5sum of a string
(defn md5-sum
"Compute the hex MD5 sum of a string."
[#^String str]
(let [alg (doto (MessageDigest/getInstance "MD5")
(.reset)
(.update (.getBytes str)))]
(try
(.toString (new BigInteger 1 (.digest alg)) 16)
(catch NoSuchAlgorithmException e
(throw (new RuntimeException e))))))

; starts at "aaaa" and rotates string based on `n'
; "aaaa", "aaab", "aaac", etc...
(defn base26 [n]
(let [seed-string "aaaa"
s (new StringBuilder seed-string)
a_val (int \a)]
(loop [pos 3 x (int n)]
(when (pos? x)
(let [digit (char (+ a_val (unchecked-remainder x
26)))]
(.setCharAt s pos digit)
(when (pos? pos)
(recur (int (dec pos)) (unchecked-divide x
26))))))
(.toString s)))

; cycles through character combinations for a given md5 string
; until the matching result is found
(defn decode-md5 [md5]
(loop [x 0]
(let [gen-string (base26 x)]
(if (= md5 (md5-sum gen-string))
gen-string
(if (< x 456976) ; 26^4
(recur (inc x)))))))

; decodes a bucket
(defn decode [bucket]
(map decode-md5 bucket))

; returns a collection of agents for a given set of work buckets
(defn spawn-agents [agents buckets]
(if (empty? buckets)
agents
(recur (conj agents (agent (first buckets)))
(rest buckets))))

; number of tasks to doll out to the agents
(def num-work-units 60)

; generate requested number of work units
(def work-units (map md5-sum (for [x (range num-work-units)]
(base26 50000)))) ; (base26 50000)
== "cvzc"

; number of agents to spawn
(def num-agents 4)

; divide the units into buckets for each agent
(def work-buckets (partition (int (/ (count work-units)
num-agents)) work-
units))

; create an agent for each bucket
(def agents (spawn-agents [] work-buckets))

; send each agent a job to do
(doseq [agent agents]
(send agent decode))

; ensure successful dispatch
(apply await agents)

; view the results
(doseq [agent agents]
(doseq [result @agent]
(println result)))

; clean up
(shutdown-agents)

Richard Newman

unread,
Jun 15, 2009, 6:36:46 PM6/15/09
to clo...@googlegroups.com
> I'm testing on a quad-core box, and I'm seeing virtually identical
> performance numbers switching between one agent and four agents. I've
> also tried larger numbers like 12 agents, but the results are the
> same. The load average also stays the same hovering around 1.0, and
> the execution time is static at approximately 15 seconds.

It looks to me like your work unit generation is lazy. I also surmise
that it's taking 15 seconds to do 50,000 MD5 computations == 0.3ms per
computation, and that might include the rest of your program, too.

Is it possible that the lazy computation of the work unit -- which is
not parallelized -- is only running fast enough to supply one agent
with input?

Have you tried completely materializing your work units before timing
the agent part?

tmountain

unread,
Jun 15, 2009, 6:56:24 PM6/15/09
to Clojure
I've tried this in lieu of the way I was generating the work units.

(def work-units (for [x (range 100)]
"88148433eeb5d372c0e352e38ac39aca"))

I know that for is still lazy, so I did this after the binding of work-
buckets:

(println work-buckets) ; yielded expected result
((88148433eeb5d372c0e352e38ac39aca
88148433eeb5d372c0e352e38ac39aca....

I'm assuming if I can see work-buckets from the print statement that
it's fully materialized. This part all happened in less than a second.
From here, all that's left to do is spawn four agents (assigning each
a work bucket), send them each a decode function, and the loop over
the result set, and I'm seeing the same results as before.

Richard Newman

unread,
Jun 15, 2009, 7:06:45 PM6/15/09
to clo...@googlegroups.com

tmountain

unread,
Jun 15, 2009, 8:22:34 PM6/15/09
to Clojure
Completely omitting work-buckets and spawn-agents, I've replaced with
the following, but the CPU still sits at 100% usage, and the run time
is still ~15 seconds.

(def work-units (doall (for [x (range 15)]
"88148433eeb5d372c0e352e38ac39aca")))
(def agents [(agent work-units)
(agent work-units)
(agent work-units)
(agent work-units)])

; send each agent a job to do
(doseq [agent agents]
(send agent decode))

; wait on the agents to complete their jobs
(apply await agents)

; see the results
(doseq [agent agents]
(doseq [result @agent]
(println result)))

; clean up
(shutdown-agents)

On Jun 15, 7:06 pm, Richard Newman <holyg...@gmail.com> wrote:
> Try doall:
>
> http://clojure.org/api#toc216

tmountain

unread,
Jun 15, 2009, 8:35:42 PM6/15/09
to Clojure
Nevermind, kindly ignore my last post ;-) You called it. The map
inside my decode function returns a lazy seq, and it was being
accessed on-demand by the doseq towards the end of the program. To
make matters worse, I was consuming the agents in a serial fashion
completely eliminating any parallel processing. After the change, it
runs in 5 seconds on four cores as opposed to 15 seconds on a single
core. Thank you for taking the time to help me with this. It's been a
learning experience.

Travis

Richard Newman

unread,
Jun 15, 2009, 8:38:23 PM6/15/09
to clo...@googlegroups.com
> After the change, it runs in 5 seconds on four cores as opposed to
> 15 seconds on a single
> core. Thank you for taking the time to help me with this. It's been a
> learning experience.

Great news! Happy to help. This stuff is pretty new to me, too :)

Reply all
Reply to author
Forward
0 new messages