On Sunday, November 11, 2012 5:06:26 AM UTC, Antti Halme wrote:
Hi,
I'm a master's student at Aalto University, Finland, and a beginner in Haskell. I'm looking at parallel computing with GPUs, and in particular, using functional programming techniques to do so. I'm playing with Accelerate and going through Marlow's recent material, so I guess I'm set in some sense. I had a quick look at the current research effort as well, and noticed something I found a little strange.
Earlier this year NVIDIA introduced the Kepler GPU architecture [1] with support for "dynamic parallelism" where kernels can be launched without returning to host code, effectively enabling nested computation on the GPU. This kind of dynamic scheduling approach seems to resemble Cilk and other Intel MIC stuff, but with lower level support. It's not just work generation. It seems like the scheduling does not mess up locality and, judging by their examples, the approach in fact improves the parallel execution granularity.
A nice talk by Stephen Jones [2] explains what the dynamic parallelism thing can do. The Mandelbrot set region-of-interest focus is particularly cool. There's also the O(nlogn) tree version of Barnes-Hut n-body. A related NVIDIA project called Copperhead is about developing a Python dialect stripped pure for declarative GPU programming, with plans to support the nested model. These got me thinking that surely the FP guys are all over this by now? I mean, to me, writing nicely composing Haskell for hardware that supports true NDP sounds very pleasant.
I looked at the Haskell research efforts in parallelism, absolutely certain to find this kind of dynamic scheduling programming model overcoming the Blelloch-style flattening stuff Peyton Jones strongly advocates [3], but as far as I could tell, ICFP and Haskell symposium (Sep 2012) had half a dozen papers on vectorization and zero on dynamic scheduling. Even "Nested Data-Parallelism on the GPU" by Bergstrom and Reppy talks only about flattening. What gives?
Now, I understand that no-one has access to Kepler GK110 -cards yet, but surely someone has at least challenged vectorization with this kind of dynamic scheduling before? Or am I missing something fundamental here? Why is flattening the best way to do nested data parallelism in functional programming? Are those arguments still valid if there's proper HW support for nested computation?
I agree, this sounds like software heaven :-) However, reality seems to be a bit more complex...
In the context of the FP7 project Apple-CORE (
www.apple-core.info) we have run experiments with an architecture that allows for massively nested thread creation. We have compiled SaC (
www.sac-home.org), an applicative purely functional array programming language onto that architecture. WIth that architecture at hand, we were keen to map nested DP directly rather than applying flattening ala Blelloch.
In essence, we can observe several challenges:
1) one of the aims of flattening is easier scheduling/ work load balencing. If you expose the nesting to the HW, the HW has to cope with these issues, ie. it has to be able to do proper thread scheduling with loads of different threads around.
2) if you expose all nested DP to the HW the leaf-threads can become too fine grained to be beneficial and you can exhaust the maximum number of theads the system can handle;
3) if dynamic memory management gets involved, massive fine-grained parallelism requires advanced memory techniques;
One may argue, that all these issues, in a way, are well-known from the early work on task-parallelism in the FP community. However, when starting from nested DP where individual spawns are typically in the hundreds at least, it appears that these can be manageable.
If you are interested about the solutions we have developed for dealing with these issues in the Apple-CORE project, you may want to look at the papers referenced below. If details are missing I am happy to engage in further discussions.
[1] S. Herhut, C. Joslin, S. Scholz, and C. Grelck, Truly Nested Data-Parallelism. Compiling SAC to the Microgrid Architecture, Seton Hall University, South Orange, NJ, USA., In: IFL ’09: Draft Proceedings of the 21st Symposium on Implementation and Application of Functional Languages. SHU-TR-CS-2009-09-1, , 2009 [see attachment]
[2] S. Herhut, C. Joslin, S. B. Scholz, R. Poss, and C. Grelck, Concurrent Non-Deferred Reference Counting on the Microgrid: First Experiences, in 22nd International Symposium on Implementation and Application of Functional Languages (IFL’10), Alphen a/d Rijn, Netherlands, Revised Selected Papers, 2011, pp. 185-202
Best wishes,
Sven-Bodo Scholz