doseq vs dorun

611 views
Skip to first unread message

Pradeep Gollakota

unread,
Oct 16, 2013, 10:34:18 PM10/16/13
to clo...@googlegroups.com

Hi All,

I’m (very) new to clojure (and loving it)… and I’m trying to wrap my head around how to correctly choose doseq vs dorun for my particular use case. I’ve read this earlier post https://groups.google.com/forum/#!msg/clojure/8ebJsllH8UY/mXtixH3CRRsJ and I had a clarifying question.

From what I gathered in the above post, it’s more efficient to use doseq instead of dorun since map creates another seq. However, if the fn you want to apply on the seq can be parallelized, doseq wouldn’t give you the ability to parallelize. With dorun you can use pmap instead of map and get parallelization.

(doseq [i some-lazy-seq] side-effect-fn)
(dorun (pmap side-effect-fn some-lazy-seq))

What is the idiomatic way of parallelizing a computation on a lazy seq?

Thanks,
Pradeep

Cedric Greevey

unread,
Oct 17, 2013, 2:53:49 AM10/17/13
to clo...@googlegroups.com
Ideally, you wouldn't be using a side effect at all, but something like reducers to return a single computed result after going over the sequence. (If the input's too big for main memory, you'd also need to partition the input seq into reducible-collection chunks small enough to fit in memory.)

If side effects are necessary because you're doing I/O for each element of the seq, then the overhead of wrapping in pmap is probably minimal as the task is I/O-bound, but the benefit of pmap may not be significant either. Threaded I/O is generally only useful for 1. preventing I/O from bottlenecking a CPU-bound task by splitting them into separate threads and 2. networking with many remote hosts, so you can usefully do something with host B while waiting for a response from host A, or with one remote host where latency and task orthogonality make several parallel interactions preferable to several sequential ones (e.g. a web browser loading images several at a time from a web server when the throughput is high but so is the latency).

If side effects are necessary because you're interacting with a legacy Java API that uses mutable state, you might want to look into pvalues and pcalls.


--
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Stefan Kamphausen

unread,
Oct 17, 2013, 3:24:50 AM10/17/13
to clo...@googlegroups.com
Hi,


What is the idiomatic way of parallelizing a computation on a lazy seq?


keep in mind, that pmap lazily processes the seq with a moving window the size of which depends on the available cores on your machine.  If the processing of one element takes a long time, the parallel work will wait for it to finish before moving on.  Thus, pmap may be an easy way to achieve parallel processing but is only suited for problems which take approximately the same time each.

Stefan

Mikera

unread,
Oct 17, 2013, 7:04:51 AM10/17/13
to clo...@googlegroups.com
I don't think there is a single idiomatic way. It depends on lots of things, e.g.:
- How expensive is each side-effect-fn? If it is cheap, then the ovehead of making things parallel may not be worth it
- Do you want to constrain the thread pool or have a separate thread for each element? For the later, futures are an option
- Where is the actual bottleneck? If an external resource is constrained, CPU parallelization may not help you at all.....
- How is the lazy sequence being produced? Is it already realised, or being computed on the fly?
- Is there any concern about ordering / concurrent access to resources / race conditions?

Assuming that side-effect-fn is relatively CPU-expensive and that the runtimes of each call to it are reasonably similar, then I'd say that your (dorun (pmap .....)) version is a decent choice. Otherwise you make want to take a look at the "reducers" library - the Fork/Join capabilities are very impressive and should do what you need.

Brian Craft

unread,
Oct 17, 2013, 6:12:31 PM10/17/13
to clo...@googlegroups.com
I have the same use case: walking a seq of an input file, and doing file/db operations for each row. pmap is working very well, but it has required a lot of attention to the data flow, to make sure that no significant compute is done in the main thread. Otherwise IO blocks the compute.

I briefly tried working with the reducers library, which generally made things 2-3 times slower, presumably because I'm using it incorrectly. I would really like to see more reducers examples, e.g. for this case: reading a seq larger than memory, doing transforms on the data, and then executing side effects.

Maximilien Rzepka

unread,
Oct 17, 2013, 10:45:45 PM10/17/13
to clo...@googlegroups.com
Not an expert but from Michal Marczyk (much more expert than me ;) never use dorun better doseq because it's chunck aware

Stefan Kamphausen

unread,
Oct 18, 2013, 3:05:50 AM10/18/13
to clo...@googlegroups.com
Hi,


On Friday, October 18, 2013 12:12:31 AM UTC+2, Brian Craft wrote:
I briefly tried working with the reducers library, which generally made things 2-3 times slower, presumably because I'm using it incorrectly. I would really like to see more reducers examples, e.g. for this case: reading a seq larger than memory, doing transforms on the data, and then executing side effects.

I used reducers for processing lots of XML files.  Probably the most common pitfall is, that fork only does parallel computation when working on a vector.  While all the XML data would not have fit into memory, the vector of filenames to read from certainly did, and that made a big difference.  Plus, I reduced the chunksize from default 512 to 1.


Cheers,
Stefan

Pradeep Gollakota

unread,
Oct 18, 2013, 3:23:51 PM10/18/13
to clo...@googlegroups.com
Hi All,

Thank you so much for your replies!

For my particular use case ("tail -f" multiple files and write the entries into a db), I'm using pmap to process each file in a separate thread and for each file, I'm using doseq to write to db. It seems to be working well (though I still need to benchmark it).

Thanks to your help, I have a better understanding of how doseq, dorun, et. al. work.
Reply all
Reply to author
Forward
0 new messages