fold over a sequence

304 views
Skip to first unread message

Paul Butcher

unread,
Mar 10, 2013, 8:40:54 PM3/10/13
to clo...@googlegroups.com
As things currently stand, fold can be used on a sequence-based reducible collection, but won't be parallel.

I'm currently working on code that processes XML generated by clojure.data.xml/parse, and would love to do so in parallel. I can't immediately see any reason why it wouldn't be possible to create a version of CollFold that takes a sequence and "chunks" it so that it can be folded in parallel. Has anyone tried this yet? Is there some gotcha lurking to catch me out?

--
paul.butcher->msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: pa...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

Rich Morin

unread,
Mar 10, 2013, 9:32:00 PM3/10/13
to clo...@googlegroups.com
On Mar 10, 2013, at 17:40, Paul Butcher wrote:
> As things currently stand, fold can be used on a sequence-
> based reducible collection, but won't be parallel.
>
> I'm currently working on code that processes XML generated by clojure.data.xml/parse, and would love to do so in parallel.
> I can't immediately see any reason why it wouldn't be possible
> to create a version of CollFold that takes a sequence and
> "chunks" it so that it can be folded in parallel. Has anyone
> tried this yet? Is there some gotcha lurking to catch me out?

I love the idea; dunno if there are any gremlins lurking. BTW,
if you haven't already done so, be sure to watch Guy Steele's talk:

How to Think about Parallel Programming: Not!
http://www.infoq.com/presentations/Thinking-Parallel-Programming

The first third of the talk (~20 minutes) is a wonderful discussion
of extreme hacking (eg, a single-card dump program for an IBM 1130).
Guy then segues into a discussion of giving up control of details in
order to let "the system" take care of things (eg, memory use).

The last part of the talk is about the conceptual basis for order-
independent folding of calculations. So, required viewing (IMHO) if
you're thinking about extending reducers, etc.

-r

--
http://www.cfcl.com/rdm Rich Morin
http://www.cfcl.com/rdm/resume r...@cfcl.com
http://www.cfcl.com/rdm/weblog +1 650-873-7841

Software system design, development, and documentation


Alan Busby

unread,
Mar 11, 2013, 12:56:38 AM3/11/13
to clo...@googlegroups.com
On Mon, Mar 11, 2013 at 9:40 AM, Paul Butcher <pa...@paulbutcher.com> wrote:
I'm currently working on code that processes XML generated by clojure.data.xml/parse, and would love to do so in parallel. I can't immediately see any reason why it wouldn't be possible to create a version of CollFold that takes a sequence and "chunks" it so that it can be folded in parallel. Has anyone tried this yet? Is there some gotcha lurking to catch me out?

I've done something like the above here;

I haven't run in to any gotcha's yet after 6+ months of extensive use.
 
I did have to copy/paste the definitions for pool, fjtask, fjinvoke, fjfork, fjjoin from core/reducers.clj and keep them from being AOT compiled (for Java 1.7 and 1.6 compatibility).

I'm not pleased with the way the thread pool is inaccessible in core/reducers.clj because I'm not aware of any mechanism to have fold *NOT* use every available resource.


Hope this helps,
    Alan

Jim foo.bar

unread,
Mar 11, 2013, 6:40:01 AM3/11/13
to clo...@googlegroups.com
I don't think you will be able to do a parallel fold on a lazy-seq which is what clojure.data.xml/parse returns. Vectors are the only persistent collection that supports parallel fold and something tells me it's because they are NOT lazy...

why can't you 'vec' the result of xml/parse and then use fold on that? Is it a massive seq?

Jim
--
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Marko Topolnik

unread,
Mar 11, 2013, 7:00:45 AM3/11/13
to clo...@googlegroups.com
The idea is to transform into a lazy sequence of eager chunks. That approach should work.

Paul Butcher

unread,
Mar 11, 2013, 9:38:18 AM3/11/13
to clo...@googlegroups.com
On 11 Mar 2013, at 10:40, Jim foo.bar <jimpi...@gmail.com> wrote:

why can't you 'vec' the result of xml/parse and then use fold on that? Is it a massive seq?

In my case, it's the Wikipedia XML dump, so around 40GiB (so no, that wouldn't work :-)

Paul Butcher

unread,
Mar 11, 2013, 9:38:51 AM3/11/13
to clo...@googlegroups.com
On 11 Mar 2013, at 11:00, Marko Topolnik <marko.t...@gmail.com> wrote:

The idea is to transform into a lazy sequence of eager chunks. That approach should work.

Exactly. Right - I guess I should put my money where my mouth is and see if I can get it working...

Paul Butcher

unread,
Mar 12, 2013, 9:34:43 AM3/12/13
to clo...@googlegroups.com
So this turned out to be pretty easy. I've implemented a function called "foldable-seq" that takes a lazy sequence and turns it into something that can be folded in parallel. I've checked an example program that uses it to count words in a Wikipedia XML dump into GitHub:


The code for foldable-seq is here:


On my 4-core MacBook Pro, I see a 40 second runtime without parallel-seq, 13 seconds with.

--
paul.butcher->msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: pa...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

Marko Topolnik

unread,
Mar 12, 2013, 9:45:16 AM3/12/13
to clo...@googlegroups.com
Nice going :) Is it really impossible to somehow do this from the outside, through the public API?

Paul Butcher

unread,
Mar 12, 2013, 9:48:52 AM3/12/13
to clo...@googlegroups.com
On 12 Mar 2013, at 13:45, Marko Topolnik <marko.t...@gmail.com> wrote:

Nice going :) Is it really impossible to somehow do this from the outside, through the public API?

I think that it *does* do it from the outside through the public API :-) I'm just reifying the (public) CollFold protocol.

I do copy a bunch of helper functions related to fork/join which are private within the reducers namespace (which isn't particularly nice). But I guess that this will go away when (if?) Clojure has a proper API to support fork/join. But I thought it better to do that than reinvent the wheel.

Or am I missing something?

Adam Clements

unread,
Mar 12, 2013, 9:49:32 AM3/12/13
to clo...@googlegroups.com
I've had exactly this problem trying to use reducers over a large file that wouldn't fit in memory.

I tried iota, but had the issue that it was still scanning and memory mapping the entire file before it would start doing anything (pulling the whole thing through ram and taking a fair few minutes).

How would feeding a line-seq into this compare to iota? And how would that compare to a version of iota tweaked to work in a slightly less eager fashion?

Adam

Marko Topolnik

unread,
Mar 12, 2013, 9:52:37 AM3/12/13
to clo...@googlegroups.com
On Tuesday, March 12, 2013 2:48:52 PM UTC+1, Paul Butcher wrote:
On 12 Mar 2013, at 13:45, Marko Topolnik <marko.t...@gmail.com> wrote:

Nice going :) Is it really impossible to somehow do this from the outside, through the public API?

I think that it *does* do it from the outside through the public API :-) I'm just reifying the (public) CollFold protocol.

I do copy a bunch of helper functions related to fork/join which are private within the reducers namespace (which isn't particularly nice). But I guess that this will go away when (if?) Clojure has a proper API to support fork/join. But I thought it better to do that than reinvent the wheel.

That's what I meant, succeed by relying on the way f/j is used by the reducers public API, without copy-pasting the internals and using them directly. So I guess the answer is "no".

-Marko
 

Paul Butcher

unread,
Mar 12, 2013, 9:55:56 AM3/12/13
to clo...@googlegroups.com
On 12 Mar 2013, at 13:52, Marko Topolnik <marko.t...@gmail.com> wrote:

That's what I meant, succeed by relying on the way f/j is used by the reducers public API, without copy-pasting the internals and using them directly. So I guess the answer is "no".

I don't believe that I could - the CollFold implementation effectively does a binary chop on the sequence which would force the whole sequence to be realised if I did the same. It's a sensible strategy for a fully realised data structure that already exists in memory, but not for a lazy sequence.

Paul Butcher

unread,
Mar 12, 2013, 10:00:19 AM3/12/13
to clo...@googlegroups.com
On 12 Mar 2013, at 13:49, Adam Clements <adam.c...@gmail.com> wrote:

How would feeding a line-seq into this compare to iota? And how would that compare to a version of iota tweaked to work in a slightly less eager fashion?

It'll not suffer from the problem of having to drag the whole file into memory, but will incur the overhead of turning everything into JVM data structures that iota avoids. I don't imagine that it would be hard to modify iota to use a similar approach though. Although I imagine that Alan's better placed to have an opinion on that :-)

Alan Busby

unread,
Mar 12, 2013, 11:55:17 AM3/12/13
to clo...@googlegroups.com
On Tue, Mar 12, 2013 at 11:00 PM, Paul Butcher <pa...@paulbutcher.com> wrote:
On 12 Mar 2013, at 13:49, Adam Clements <adam.c...@gmail.com> wrote:

How would feeding a line-seq into this compare to iota? And how would that compare to a version of iota tweaked to work in a slightly less eager fashion?

It'll not suffer from the problem of having to drag the whole file into memory, but will incur the overhead of turning everything into JVM data structures that iota avoids. I don't imagine that it would be hard to modify iota to use a similar approach though. Although I imagine that Alan's better placed to have an opinion on that :-)

If you're dealing with a stream that you'll only read once, then Paul's foldseq+line-seq should work much more effectively than Iota I'd think.
Iota generates an index of line numbers so (nth iota-vec 100) is O(1), which is why it'll read through the entire file on loading. If you only need to access each line once, then that initial indexing step is likely a waste.

If Paul wouldn't mind I'd like to add a a similar "seq" function to Iota that would allow for index-less processing like he did in foldable-seq.

Paul Butcher

unread,
Mar 12, 2013, 11:59:45 AM3/12/13
to clo...@googlegroups.com
On 12 Mar 2013, at 15:55, Alan Busby <theb...@thebusby.com> wrote:

If Paul wouldn't mind I'd like to add a a similar "seq" function to Iota that would allow for index-less processing like he did in foldable-seq.

Paul would be delighted :-)

Stuart Sierra

unread,
Mar 12, 2013, 2:26:34 PM3/12/13
to clo...@googlegroups.com
Hi Paul,

This might be an interesting contribution to clojure.core.reducers. I haven't looked at your code in detail, so I can't say for sure, but being able to do parallel fold over semi-lazy sequences would be very useful.

-S

Paul Butcher

unread,
Mar 12, 2013, 3:34:30 PM3/12/13
to clo...@googlegroups.com
On 12 Mar 2013, at 18:26, Stuart Sierra <the.stua...@gmail.com> wrote:

This might be an interesting contribution to clojure.core.reducers. I haven't looked at your code in detail, so I can't say for sure, but being able to do parallel fold over semi-lazy sequences would be very useful.

I'd be delighted if this (or something like it) could make it into clojure.core.reducers, Stuart. What would I need to do to make this happen?

Stuart Sierra

unread,
Mar 12, 2013, 5:48:44 PM3/12/13
to clo...@googlegroups.com
See clojure.org/contributing 

it's all there.
--
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to a topic in the Google Groups "Clojure" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/clojure/8RKCjF00ukQ/unsubscribe?hl=en.
To unsubscribe from this group and all its topics, send an email to clojure+u...@googlegroups.com.

Paul Butcher

unread,
Mar 13, 2013, 10:03:37 AM3/13/13
to clo...@googlegroups.com
Thanks Stuart - my Contributor Agreement is on its way.

In the meantime, I've published foldable-seq as a library:


I'd be very interested in any feedback on the code or how it works. 

--
paul.butcher->msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: pa...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.

Jim foo.bar

unread,
Mar 13, 2013, 10:05:03 AM3/13/13
to clo...@googlegroups.com
how come your project depends on the problematic version 1.5.0?

Jim

Paul Butcher

unread,
Mar 13, 2013, 10:12:19 AM3/13/13
to clo...@googlegroups.com
On 13 Mar 2013, at 14:05, "Jim foo.bar" <jimpi...@gmail.com> wrote:

how come your project depends on the problematic version 1.5.0?

1.5.0 is problematic?

Jim foo.bar

unread,
Mar 13, 2013, 10:13:35 AM3/13/13
to clo...@googlegroups.com
there was a memory leak hence the 1.5.1 release the next day...

Jim

Paul Butcher

unread,
Mar 13, 2013, 10:21:59 AM3/13/13
to clo...@googlegroups.com
Ah - sorry, missed that. I've just released version 0.2, which depends on 1.5.1.

--
paul.butcher->msgCount++

Snetterton, Castle Combe, Cadwell Park...
Who says I have a one track mind?

http://www.paulbutcher.com/
LinkedIn: http://www.linkedin.com/in/paulbutcher
MSN: pa...@paulbutcher.com
AIM: paulrabutcher
Skype: paulrabutcher

Reply all
Reply to author
Forward
0 new messages