What if there's hardware support for nested data parallelism?

Antti Halme

unread,

Nov 11, 2012, 12:06:25 AM11/11/12

to parallel...@googlegroups.com

Hi,

I'm a master's student at Aalto University, Finland, and a beginner in Haskell. I'm looking at parallel computing with GPUs, and in particular, using functional programming techniques to do so. I'm playing with Accelerate and going through Marlow's recent material, so I guess I'm set in some sense. I had a quick look at the current research effort as well, and noticed something I found a little strange.

Earlier this year NVIDIA introduced the Kepler GPU architecture [1] with support for "dynamic parallelism" where kernels can be launched without returning to host code, effectively enabling nested computation on the GPU. This kind of dynamic scheduling approach seems to resemble Cilk and other Intel MIC stuff, but with lower level support. It's not just work generation. It seems like the scheduling does not mess up locality and, judging by their examples, the approach in fact improves the parallel execution granularity.

A nice talk by Stephen Jones [2] explains what the dynamic parallelism thing can do. The Mandelbrot set region-of-interest focus is particularly cool. There's also the O(nlogn) tree version of Barnes-Hut n-body. A related NVIDIA project called Copperhead is about developing a Python dialect stripped pure for declarative GPU programming, with plans to support the nested model. These got me thinking that surely the FP guys are all over this by now? I mean, to me, writing nicely composing Haskell for hardware that supports true NDP sounds very pleasant.

I looked at the Haskell research efforts in parallelism, absolutely certain to find this kind of dynamic scheduling programming model overcoming the Blelloch-style flattening stuff Peyton Jones strongly advocates [3], but as far as I could tell, ICFP and Haskell symposium (Sep 2012) had half a dozen papers on vectorization and zero on dynamic scheduling. Even "Nested Data-Parallelism on the GPU" by Bergstrom and Reppy talks only about flattening. What gives?

Now, I understand that no-one has access to Kepler GK110 -cards yet, but surely someone has at least challenged vectorization with this kind of dynamic scheduling before? Or am I missing something fundamental here? Why is flattening the best way to do nested data parallelism in functional programming? Are those arguments still valid if there's proper HW support for nested computation?

Thanks,
Antti

[1] - White paper
http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf

[2] - Video, slides
http://nvidia.fullviewmedia.com/gtc2012/0517-J3-S0338.html
http://developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S0338-GTC2012-CUDA-Programming-Model.pdf

[3] - SPJ
"The Future is Parallel, and the Future of Parallel is Declarative"
http://www.youtube.com/watch?v=hlyQjK1qjw8
"Data Parallel Haskell"
http://www.youtube.com/watch?v=NWSZ4c9yqW8

Carter Schonwald

unread,

Nov 11, 2012, 12:32:49 AM11/11/12

to antti...@aalto.fi, parallel...@googlegroups.com

Hey Antti,

While I can't answer the question of which techniques etc, there actually has been some interesting recent work in the FP space done wrt the dynamic parallelism vein of work!

the manticore project/language

http://manticore.cs.uchicago.edu/

seems to be a bunch of work in that space.

I'm not sure if those ideas map nicely to haskell or not (or at least I've not thought about it), but I believe thats the sort of techniques you're wondering about?

hope that helps (partially) answer your question!

-Carter

Ben Lippmeier

unread,

Nov 11, 2012, 4:22:53 AM11/11/12

to antti...@aalto.fi, parallel...@googlegroups.com

On 11/11/2012, at 16:06 , Antti Halme wrote:

> I looked at the Haskell research efforts in parallelism, absolutely certain to find this kind of dynamic scheduling programming model overcoming the Blelloch-style flattening stuff Peyton Jones strongly advocates [3], but as far as I could tell, ICFP and Haskell symposium (Sep 2012) had half a dozen papers on vectorization and zero on dynamic scheduling. Even "Nested Data-Parallelism on the GPU" by Bergstrom and Reppy talks only about flattening. What gives?

It's the wheel of reinvention. Funnily enough, people find SIMD style vector programming too constraining and are going back to a thread based model -- like in the multicore "revolution". The idea of parallel threads launching more parallel threads isn't a new one. The typical challenges with a thread-based model include: 1) granularity 2) communication.

DPH is drifting away from the vector model anyway. It turns out that the machines we're trying to target aren't vector machines, they're shared memory multicore machines. Back in the NESL heyday Cray's were cool, the memory bus could keep up with the processor, and streaming out an intermediate vector to memory was less of a problem. NESL didn't fuse array operators, and was still useful. With DPH we need aggressive array fusion to keep intermediate data local to each processor, and various parts of it now depend on shared memory. I'm also moving to a new fusion model which will allow threads to send messages to each other while executing a fused array operation. We're not yet dynamically forking new threads, but it wouldn't shock me if we find some reason to do so next year.

The machines we have these days are a hybrid of several different approaches, and the programming systems that target them are going to be hybrids as well.

Ben.

Bodo

unread,

Nov 13, 2012, 11:55:17 AM11/13/12

to parallel...@googlegroups.com

On Sunday, November 11, 2012 5:06:26 AM UTC, Antti Halme wrote:

Hi,

I'm a master's student at Aalto University, Finland, and a beginner in Haskell. I'm looking at parallel computing with GPUs, and in particular, using functional programming techniques to do so. I'm playing with Accelerate and going through Marlow's recent material, so I guess I'm set in some sense. I had a quick look at the current research effort as well, and noticed something I found a little strange.

Earlier this year NVIDIA introduced the Kepler GPU architecture [1] with support for "dynamic parallelism" where kernels can be launched without returning to host code, effectively enabling nested computation on the GPU. This kind of dynamic scheduling approach seems to resemble Cilk and other Intel MIC stuff, but with lower level support. It's not just work generation. It seems like the scheduling does not mess up locality and, judging by their examples, the approach in fact improves the parallel execution granularity.

A nice talk by Stephen Jones [2] explains what the dynamic parallelism thing can do. The Mandelbrot set region-of-interest focus is particularly cool. There's also the O(nlogn) tree version of Barnes-Hut n-body. A related NVIDIA project called Copperhead is about developing a Python dialect stripped pure for declarative GPU programming, with plans to support the nested model. These got me thinking that surely the FP guys are all over this by now? I mean, to me, writing nicely composing Haskell for hardware that supports true NDP sounds very pleasant.

I looked at the Haskell research efforts in parallelism, absolutely certain to find this kind of dynamic scheduling programming model overcoming the Blelloch-style flattening stuff Peyton Jones strongly advocates [3], but as far as I could tell, ICFP and Haskell symposium (Sep 2012) had half a dozen papers on vectorization and zero on dynamic scheduling. Even "Nested Data-Parallelism on the GPU" by Bergstrom and Reppy talks only about flattening. What gives?

Now, I understand that no-one has access to Kepler GK110 -cards yet, but surely someone has at least challenged vectorization with this kind of dynamic scheduling before? Or am I missing something fundamental here? Why is flattening the best way to do nested data parallelism in functional programming? Are those arguments still valid if there's proper HW support for nested computation?

I agree, this sounds like software heaven :-) However, reality seems to be a bit more complex...

In the context of the FP7 project Apple-CORE (www.apple-core.info) we have run experiments with an architecture that allows for massively nested thread creation. We have compiled SaC (www.sac-home.org), an applicative purely functional array programming language onto that architecture. WIth that architecture at hand, we were keen to map nested DP directly rather than applying flattening ala Blelloch.

In essence, we can observe several challenges:

1) one of the aims of flattening is easier scheduling/ work load balencing. If you expose the nesting to the HW, the HW has to cope with these issues, ie. it has to be able to do proper thread scheduling with loads of different threads around.

2) if you expose all nested DP to the HW the leaf-threads can become too fine grained to be beneficial and you can exhaust the maximum number of theads the system can handle;

3) if dynamic memory management gets involved, massive fine-grained parallelism requires advanced memory techniques;

One may argue, that all these issues, in a way, are well-known from the early work on task-parallelism in the FP community. However, when starting from nested DP where individual spawns are typically in the hundreds at least, it appears that these can be manageable.

If you are interested about the solutions we have developed for dealing with these issues in the Apple-CORE project, you may want to look at the papers referenced below. If details are missing I am happy to engage in further discussions.

[1] S. Herhut, C. Joslin, S. Scholz, and C. Grelck, Truly Nested Data-Parallelism. Compiling SAC to the Microgrid Architecture, Seton Hall University, South Orange, NJ, USA., In: IFL ’09: Draft Proceedings of the 21st Symposium on Implementation and Application of Functional Languages. SHU-TR-CS-2009-09-1, , 2009 [see attachment]

[2] S. Herhut, C. Joslin, S. B. Scholz, R. Poss, and C. Grelck, Concurrent Non-Deferred Reference Counting on the Microgrid: First Experiences, in 22nd International Symposium on Implementation and Application of Functional Languages (IFL’10), Alphen a/d Rijn, Netherlands, Revised Selected Papers, 2011, pp. 185-202

Best wishes,

Sven-Bodo Scholz

paper.pdf

Reply all

Reply to author

Forward