--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
Wouldn't the following be close enough to what you want?
(defn email-approved [approved]
(doall (pmap email-request approved)))
--
Sean A Corfield -- (904) 302-SEAN
An Architect's View -- http://corfield.org/
World Singles, LLC. -- http://worldsingles.com/
Railo Technologies, Inc. -- http://www.getrailo.com/
"Perfection is the enemy of the good."
-- Gustave Flaubert, French realist novelist (1821-1880)
I often launch parallel threads with pmap and have sometimes been puzzled by dips in processor utilization that I can't trace to memory resource contention, etc.
I have similar issues sometimes when I launch parallel threads via sends to agents. Will this behave similarly to pmap? If so, is there a straightforward way to get the same kind of benefit as medusa-pmap in an agent context?
-Lee
What I want is a version of pmap that will always use any available cores to compute remaining values (except of course the last couple of values, when there are less remaining values than cores).
In other words, I want the behavior that Andy Fingerhut describes medusa as having here:
> On Sep 22, 2011, at 11:34 PM, Andy Fingerhut wrote:
>
> > pmap will limit the maximum number of simultaneous threads. So will the medusa library's medusa-pmap.
> >
> > The difference is that if one job early in the list takes, e.g., 100 times longer than the next 20, and you have 4 cpus available, pmap will start the first (4+2)=6 in parallel threads, let jobs 2 through 6 complete, and then wait for the first one to finish before starting number 7. Thus most of the time will be spent running one thread. This has the advantage of limiting the memory required to store intermediate results, but the disadvantage of low usage of multiple cpu cores.
> >
> > Medusa has a medusa-pmap that will also limit the parallelism, but it lets you pick the level of parallelism (it isn't fixed at (# cpus + 2)), and it will continue starting new threads, even if that means using more memory when one job takes much longer than the following jobs.
> >
FWIW I'll always be calling this on a finite sequence (although it may have 10000 elements or so, so I presume that I shouldn't start threads for all of them at once), and I will want a fully realized result sequence so I'll surround calls with doall if necessary. I will want all of the pmapped computations to finish before I continue doing anything else.
I know that I could launch the threads in other ways -- e.g. I sometimes use agents for something similar -- but pmap is much more elegant in many cases, and anyway I'm not even sure I'm getting full utilization with my other methods.
The medusa-pmap function is very close to what I want but medusa seems to do more that what I need, it requires initi/startup calls, it involves timeouts which I will never want, and it behaves strangely when I run my code on a 48 core node. (It runs fine on my 4-core laptop, and then it seems to work beautifully on the 48 core node too for a while, giving me nearly 4800% utilization, but then it does something that looks like it might be caused by everything hitting the timeouts... which I bumped up by several orders of magnitude but I'm still getting the weird behavior).
So: Is there a simpler way to get "pmap but not lazy and use all cores fully until the whole result sequence is completed"? This is usually what I really want when I'm writing code that utilizes concurrency.
I've looked at the pmap source but it isn't obvious to me how to change that to do what I want... So any pointers would be appreciated.
Thanks,
-Lee
But I'm not sure if it's the best approach and I'd like some feedback. If it *is* a good approach then maybe we should refine it and make it more widely available. If it's not the best approach then I'd love some advice on how to do it better.
The use case here -- which I think must be shared by at least some others -- is that I have a finite, non-lazy collection of inputs on which I'd like to run an expensive (but not uniformly expensive) function, gathering the results in a non-lazy sequence. This is a *very* common need in my own projects. I don't care about the order in which the function calls are made, and I'm not concerned about the memory overhead of retaining all of the results since I want to keep them all anyway. I just want all of the computations done as quickly as possible, using all of my available cores. The pmap function seems at first to provide an elegant way to do what's needed, e.g. with (doall (pmap f inputs)), but as discussed earlier in this thread it will often wait for the completion of earlier calls before starting later calls, and this will be particularly problematic when the runtimes of the calls are uneven.
The medusa package provides something that would seem to fit the bill better, but it comes with other baggage that I don't need or want. I just want a version of pmap that will use all available CPU resources to aggressively complete all of the computations in a non-lazy context.
Here's my stab at doing this using agents:
(defn pmapall
"Like pmap but: 1) coll should be finite, 2) the returned sequence
will not be lazy, 3) calls to f may occur in any order, to maximize
multicore processor utilization, and 4) takes only one coll so far."
[f coll]
(let [agents (map agent coll)]
(dorun (map #(send % f) agents))
(apply await agents)
(doall (map deref agents))))
I should make a version that takes multiple colls, but I for now've written it to take just one for clarity.
This does appear to speed things up pretty significantly in certain circumstances, but maybe not as much as possible. Is it the best approach?
To show that it beats pmap I define a time-wasting function (I want to see real cpu utilization so I'm not using delays) like this:
(defn burn
[]
(dotimes [i 10000]
(reduce * (map float (take 1000 (iterate inc i))))))
And then I define a function that takes a lot or a little bit of time depending on its argument:
(defn fast-or-slow
[n]
(if (zero? n)
:done
(do (burn)
:done)))
And then I create an vector of inputs in which the slow ones are scattered sparsely:
(def inputs (take 1000 (cycle (conj (repeat 20 0) 1))))
And then on a 48 core node I get timings like this:
user=> (time (last (pmapall fast-or-slow inputs)))
"Elapsed time: 37244.151 msecs"
:done
user=> (time (last (doall (pmap fast-or-slow inputs))))
"Elapsed time: 110862.187 msecs"
:done
And by the way, plain old serial map does this:
user=> (time (last (doall (map fast-or-slow inputs))))
"Elapsed time: 260941.836 msecs"
So we've improved things; pmap is a little better than twice as fast as map, and pmapall is roughly 3 times faster than pmap. So I think I'm ready to switch all of my pmaps to pmapalls. But that's still nothing close to a 48x speedup, even though all of the tasks should be completely independent and I wouldn't expect a huge loss for coordination. And another confusing thing is that even with pmapall, when I look at the CPU utilization I see numbers close to 4800% in some cases (like the one above) but numbers topping out at something more like 1900% in some others (e.g. with different input vectors).
So I feel like I'm moving in the right direction but that I'm still probably missing something.
Is there a better way to do this? Surely it will come in handy for others as well if there's a simple way to more effectively dispatch tasks to multiple cores.
Thoughts? Code?
Thanks,
-Lee
#<Exception java.lang.Exception: Can't await in agent action>
Which means, I think, that I can't call pmapall within a function that I pass to pmapall. Unfortunate.
Is there a better way?
-Lee
PS to see these exceptions one must change the call to agent in my definition with something like #(agent % :error-handler (fn [agnt except] (println except))).
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clo...@googlegroups.com
> Note that posts from new members are moderated - please be patient with your first post.
> To unsubscribe from this group, send email to
> clojure+u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
--
Lee Spector, Professor of Computer Science
Cognitive Science, Hampshire College
893 West Street, Amherst, MA 01002-3359
lspe...@hampshire.edu, http://hampshire.edu/lspector/
Phone: 413-559-5352, Fax: 413-559-5438
What you're really looking for is pdoseq, right? Seems like futures
might be a better building-block for this, although again Clojure's
lack of flexibility over the thread pool could easily bite you here.
-Phil
No -- I want all of the returned values, as map/pmap provides but doseq does not. I want what pmap does except that I want greedier use of available cores until all of the values are computed, at the expense of computing them in order.
-Lee
Does your pmap-pool permit nesting? (That is, does it permit passing pmap-pool a function which itself calls pmap-pool?). If so then that would be a reason to prefer it over my pmapall.
Thanks,
-Lee
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clo...@googlegroups.com
> Note that posts from new members are moderated - please be patient with your first post.
> To unsubscribe from this group, send email to
> clojure+u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
--
Hi Andy,
> pmap will limit the maximum number of simultaneous threads. So will
> the medusa library's medusa-pmap.
>
> The difference is that if one job early in the list takes, e.g., 100
> times longer than the next 20, and you have 4 cpus available, pmap
> will start the first (4+2)=6 in parallel threads, let jobs 2 through 6
> complete, and then wait for the first one to finish before starting
> number 7. Thus most of the time will be spent running one thread.
Wow, that would render pmap pretty useless, IMO. I've just checked the
code, but I cannot grasp it completely. Could you please explain a bit,
or tell me where I'm wrong below?
(I think, I found the answer myself, so feel free to jump forwart do the
"AAAHHHH!!!".)
Here's the definition:
--8<---------------cut here---------------start------------->8---
(defn pmap
([f coll]
(let [n (+ 2 (.. Runtime getRuntime availableProcessors))
rets (map #(future (f %)) coll) ;; (1)
step (fn step [[x & xs :as vs] fs] ;; (2)
(lazy-seq
(if-let [s (seq fs)]
(cons (deref x) (step xs (rest s))) ;; (3)
(map deref vs))))]
(step rets (drop n rets))))
;; Overloaded variant stripped
)
--8<---------------cut here---------------end--------------->8---
Because rets is a lazy seq, nothing in it is actually realized in (1),
so that means that no future is created and thus no new thread is
started there.
The `step' function (2) always dereferences the first element of the
first seq (3), thus here is the place where the future is actually
created by submitting to some thread pool. But dereferencing also
blocks until the result is calculated (i.e., calls future.get()). So it
looks to me as if there's nothing parallel at all.
Of course, I must be wrong! But where???
Hm, well, the destructuring in the arglist of `step' probably realizes
`x' (and maybe create a future for the first element in `xs'). So then
we'd have at most two active threads, but still we wait before starting
the next one...
AAAHHHH!!! I think, I got it! It's the `drop', isn't it? To drop n
elements, you have to call `rest' n times, and because that calls `seq',
actually the first n elements are realized, that is, the tasks are
submitted to the thread pool.
So indeed, one starts with the number of available processors + 2, and
one single longer running task will wipe out any parallelism. :-(
IMO, that's not what 99% of the users would expect nor want when calling
(pmap foo coll). I'd vote for making pmap eager in the sense that it
should always try to keep n threads working as long as more tasks are
available. Clearly, the current implementation is useful when your
tasks are equally expensive, i.e., the costs don't depend on the
argument.
Bye,
Tassilo
I can imagine cases in which one *would* want the current behavior of pmap, especially with infinite lazy sequences, so I wouldn't go quite as far as you are going here.
But I agree that many users will often want an eager version that maximizes CPU utilization, and that they may expect pmap to do that. In fact that was what I wanted (and will usually want), and it's what I assumed pmap would do (even though, now that I look afresh at the doc string, it pretty clearly says that it *doesn't* do this).
So my hope wouldn't be that we change pmap but rather that we provide something else that is just as simple to use but provides eager, "use all available cores to get the job done as fast as possible" behavior with a simple pmap-like interface.
My agent-based pmapall probably isn't the best approach, since calls to it can't be nested. Perhaps j-g-faustus's pmap-pool approach is the way to go, but I do not understand it well enough to know.
In any event I would hope that we could provide something like this, because I do think it will fill a common need.
-Lee
Does your pmap-pool permit nesting? (That is, does it permit passing pmap-pool a function which itself calls pmap-pool?). If so then that would be a reason to prefer it over my pmapall.
--
Hi Lee,
>> So indeed, one starts with the number of available processors + 2,
>> and one single longer running task will wipe out any parallelism. :-(
>>
>> IMO, that's not what 99% of the users would expect nor want when
>> calling (pmap foo coll). I'd vote for making pmap eager in the sense
>> that it should always try to keep n threads working as long as more
>> tasks are available. Clearly, the current implementation is useful
>> when your tasks are equally expensive, i.e., the costs don't depend
>> on the argument.
>
> I can imagine cases in which one *would* want the current behavior of
> pmap, especially with infinite lazy sequences, so I wouldn't go quite
> as far as you are going here.
You are right, I didn't think about infinite seqs. So pmap should be
lazy in order to terminate at all and to be a drop-in replacement for
map. But isn't there a way to always keep n submissions to the thread
pool ahead from the actual realization? Of course, that would mean that
in the case when you don't realize all elements, you have calculated n
elements too much.
> But I agree that many users will often want an eager version that
> maximizes CPU utilization, and that they may expect pmap to do that.
> In fact that was what I wanted (and will usually want), and it's what
> I assumed pmap would do
Agreed. Now you've convinced me that pmap shouldn't be eager, but there
should be an eager version.
> (even though, now that I look afresh at the doc string, it pretty
> clearly says that it *doesn't* do this).
Sort of:
Semi-lazy in that the parallel computation stays ahead of the
consumption, but doesn't realize the entire result unless required.
If I understand the code correctly (see my last mail), then the part
that the "parallel computation stays ahead of the computation" is not
true. It starts parallel but converges to sequential evaluation.
> So my hope wouldn't be that we change pmap but rather that we provide
> something else that is just as simple to use but provides eager, "use
> all available cores to get the job done as fast as possible" behavior
> with a simple pmap-like interface.
Yes.
BTW: Am I the only one who sometimes would also like to have an eager
sequential map variant? At the weekend I've written a proof-of-concept
evaluator that gets some syntax graph of some query language and some
graph on which the query should be evaluated. The evaluation model is
simple syntax-driven recursive evaluation, i.e., to calculate the result
of some node in the syntax graph, one simply evaluates the children and
combines the results. There are also nodes that declare variables with
value ranges, where the child subgraph is then evaluated once for each
possible variable binding. That felt very natural to implement using a
^:dynamic hash-map *vars* holding the current binding in that scope and
`binding' to change the current one. Something along the lines of:
(for [b bindings]
(binding [*vars* (merge *vars* b)]
;; evaluate the children
))
However, one has to force realization of any lazy seq in order to be
sure that the calculation is performed in the dynamic extent of the
surrounding variable binding. So in the sketch above, there's a `doall'
around the `for'.
Bye,
Tassilo
On Oct 11, 2011, at 2:07 PM, Andy Fingerhut wrote:
> One benefit would be convenience of enabling parallelism on nested data structures. One function at the top level could use parallelism, and the pieces, perhaps handled by separate functions, and perhaps nested several levels deep in function calls, could also use parallelism.
Or consider the following scenario (which is exactly what I was doing :-): You want to produce the next generation of programs in an evolutionary computation system from n parents, each of which may produce m offspring, with n much larger than your number of cores (c). The runtimes for offspring production may vary widely. So you do something like (pmapall reproduce population) to maximize processor utilization among the parents, but late in the process, when the number of parents who are still busy making offspring is less than c, your cores start to go idle. Suppose you have just a few remaining parents who are much slower than the others, and each has to chug through m independent birth processes with cores sitting idle. If you could nest calls to pmapall then those slowpokes would be farming some of their m births out to the available cores.
In this situation, at least, I don't care about the order in which the (* n m) independent tasks are completed. I just want them all done, with as little wasted CPU capacity as possible. And I don't want to somehow factor out all of the births into one long queue that I pass to a single call to pmapall. The natural decomposition of the problem is to say something like (pmapall reproduce population) at the population level, something like (pmapall (fn [_] (make-baby parent)) (range litter-size)) at the parent level, and ask the platform to make the calls as aggressively as possible, reusing cores for pending calls whenever possible.
Calls to pmap can be nested and I *think* that they will actually do some of this, if the sequencing of fast vs slow things is fortuitous. For example, I *think* that in the some of the situations that have been described here, where some cores are idle waiting for a slow, early computation in a sequence to complete, a pmap from within the slow, early computation will use some of those available cores. I'm not certain of this, however! And in any event it's not ideal for the reasons that have been raised.
-Lee
> I haven't read the entire thread carefully, but have you considered
> the "work" library <https://github.com/getwoven/work> as a potential
> fit for what you are trying to do?
I hadn't heard of it, but it does look promising and I will check it out. Thanks!
-Lee