ANN: ParallelAccelerator v0.2 for Julia 0.5 released.

Todd Anderson

unread,

Oct 25, 2016, 12:42:44 PM10/25/16

to julia-users

The High Performance Scripting team at Intel Labs is pleased to announce the release of version 0.2 of ParallelAccelerator.jl, a package for high-performance parallel computing in Julia, primarily oriented around arrays and stencils. In this release, we provide support for Julia 0.5 and introduce experimental support for the Julia native threading backend. While we still currently support Julia 0.4, such support should be considered deprecated and we recommend everyone move to Julia 0.5 as Julia 0.4 support may be removed in the future.

The goal of ParallelAccelerator is to accelerate the computational kernel of an application by the programmer simply annotating the kernel function with the @acc (short for "accelerate") macro, provided by the ParallelAccelerator package. In version 0.2, ParallelAccelerator still defaults to transforming the kernel to OpenMP C code that is then compiled with a system C compiler (ICC or GCC) and transparently handles the invocation of the C code from Julia as if the program were running normally.

However, ParallelAccelerator v0.2 also introduces experimental backend support for Julia's native threading (which is also experimental). To enable native threading mode, set the environment variable PROSPECT_MODE=threads. In this mode, ParallelAccelerator identifies pieces of code that can be run in parallel and then runs that code as if it had been annotated with Julia's @threads and goes through the standard Julia compiler pipeline with LLVM. The ParallelAccelerator C backend has the limitation that the kernel functions and anything called by those cannot include code that is not type-stable to a single type. In particular, variables of type Any are not supported. In practice, this restriction was a significant limitation. For the native threading backend, no such restriction is necessary and thus our backend should handle arbitrary Julia code.

Under the hood, ParallelAccelerator is essentially a domain-specific compiler written in Julia. It performs additional analysis and optimization on top of the Julia compiler. ParallelAccelerator discovers and exploits the implicit parallelism in source programs that use parallel programming patterns such as map, reduce, comprehension, and stencil. For example, Julia array operators such as .+, .-, .*, ./ are translated by ParallelAccelerator internally into data-parallel map operations over all elements of input arrays. For the most part, these patterns are already present in standard Julia, so programmers can use ParallelAccelerator to run the same Julia program without (significantly) modifying the source code.

Version 0.2 should be considered an alpha release, suitable for early adopters and Julia enthusiasts. Please file bugs at https://travis-ci.org/IntelLabs/ParallelAccelerator.jl/issues .

See our GitHub repository at https://github.com/IntelLabs/ParallelAccelerator.jl for a complete list of prerequisites, supported platforms, example programs, and documentation.

Thanks to our colleagues at Intel and Intel Labs, the Julia team, and the broader Julia community for their support of our efforts!

Best regards,
The High Performance Scripting team
(Parallel Computing Lab, Intel Labs)

Todd Anderson

unread,

Oct 25, 2016, 4:10:36 PM10/25/16

to julia-users

Actually, it seems that the pull request into Julia metadata to set the default ParallelAccelerator version has not yet been merged. So, if you want the simplest package update or to get the correct version with a simple Pkg.add then hold off until the merge happens. I'll post here again when that happens.

thanks,

Todd

Todd Anderson

unread,

Oct 26, 2016, 8:13:38 PM10/26/16

to julia-users

Okay, METADATA with ParallelAccelerator verison 0.2 has been merged so if you do a standard Pkg.add() or update() you should get the latest version.

For native threads, please note that we've identified some issues with reductions and stencils that have been fixed and we will shortly be released in version 0.2.1. I will post here again when that release takes place.

Again, please give it a try and report back with experiences or file bugs.

thanks!

Todd

Ralph Smith

unread,

Oct 26, 2016, 10:51:33 PM10/26/16

to julia-users

This is great stuff. Initial observations (under Linux/GCC) are that native threads are about 20% faster than OpenMP, so I surmise you are feeding LLVM some very tasty

code. (I tested long loops with straightforward memory access.)

On the other hand, some of the earlier posts make me think that you were leveraging the strong vector optimization of the Intel C compiler and its tight coupling to

MKL libraries. If so, is there any prospect of getting LLVM to take advantage of MKL?

Jeffrey Sarnoff

unread,

Oct 27, 2016, 3:20:05 AM10/27/16

to julia-users

With appreciation for Intel Labs' commitment, our thanks to the people who landed v0.2 of the ParallelAccelerator project.

On Wednesday, October 26, 2016 at 8:13:38 PM UTC-4, Todd Anderson wrote:

Todd Anderson

unread,

Oct 27, 2016, 1:04:33 PM10/27/16

to julia-users

That's interesting. I generally don't test with gcc and my experiments with ICC/C have shown something like 20% slower for LLVM/native threads for some class of benchmarks (like blackscholes) but 2-4x slower for some other benchmarks (like laplace-3d). The 20% may be attributable to ICC being better (including at vectorization like you mention) but certainly not the 2-4x. These larger differences are still under investigation.

I guess something we have said in the docs or our postings have created this impression that our performance gains are somehow related to MKL or blas in general. If you have MKL then you can compile Julia to use it through its LLVM path. ParallelAccelerator does not insert calls to MKL where they didn't exist in the incoming IR and I don't think ICC does either. If MKL calls exist in the incoming IR then we don't modify them either.

Chris Rackauckas

unread,

Oct 27, 2016, 1:47:57 PM10/27/16

to julia-users

Thank you for all of your amazing work. I will be giving v0.2 a try soon. But I have two questions:

1) How do you see ParallelAccelerator integrating with packages? I asked this in the chatroom, but I think having it here might be helpful for others to chime in. If I want to use ParallelAccelerator in a package, then it seems like I would have to require it (and make sure that every user I have can compile it!) and sprinkle the macros around. Is there some sensible way to be able to use ParallelAccelerator if it's available on the user's machine, but not otherwise? This might be something that requires Pkg3, but even with Pkg3 I don't see how to do this without having one version of the function with a macro, and another without it.

2) What do you see as the future of ParallelAccelerator going forward? It seems like Base Julia is stepping all over your domain: automated loop fusing, multithreading, etc. What exactly does ParallelAccelerator give that Base Julia does not or, in the near future, will not / can not? I am curious because with Base Julia getting so many optimizations itself, it's hard to tell whether supporting ParallelAccelerator will be a worthwhile investment in a year or two, and wanted to know what you guys think of that. I don't mean you haven't done great work: you clearly have, but it seems Julia is also doing a lot of great work!

Viral Shah

unread,

Oct 27, 2016, 2:54:25 PM10/27/16

to julia-users

Not speaking on behalf of the ParallelAccelerator team - but the long term future of ParallelAccelerator in my opinion is to do exactly that - keep pushing on new things and get them (code or ideas) merged into Base as they stabilize. Without the ParallelAccelerator team pushing us, multi-threading would have moved much slower.

-viral

Todd Anderson

unread,

Oct 27, 2016, 5:02:38 PM10/27/16

to julia-users

To answer your question #1, would the following be suitable? There may be a couple details to work out but what about the general approach?

if haskey(Pkg.installed(), "ParallelAccelerator")
    println("ParallelAccelerator present")

    using ParallelAccelerator

    macro PkgCheck(ast)
        quote
            @acc $(esc(ast))
        end
    end
else
    println("ParallelAccelerator not present")

    macro PkgCheck(ast)
        return ast
    end
end

@PkgCheck function f1(x)
    x * 5
end

a = f1(10)
println("a = ", a)

2) The point of ParallelAccelerator is to extract the implicit parallelism automatically. The purpose of @threads is to allow you to express parallelism explicitly. So, they both enable parallelism but the former has the potential to be a lot easier to use particularly for scientific programmers who are more scientist than programmer. In general, I feel there is room for all approaches to be supported across a range of programming ability.

Todd Anderson

unread,

Oct 28, 2016, 6:01:22 PM10/28/16

to julia-users

Looks like version 0.2.1 has been merged now.

Chris Rackauckas

unread,

Oct 28, 2016, 9:14:06 PM10/28/16

to julia-users

1) Won't that have bad interactions with pre-compilation? Since macros apply at parse time, the package will stay in the "state" the it precompiles in: so if one precompiles the package and then adds ParallelAccelerator, wouldn't that not be used? And the other way around, if one removes ParallelAccelerator, won't the package be unusable without manually deleting the precompile cache? I think that to use this you'd have to link the precompilation of the package to whether you have changes in ParallelAccelerator.

2) Shouldn't/Won't Base auto-parallelize broadcasted calls? That seems like the clear next step after loop fusing is finished and threading is no longer experimental. Where else is the implicit parallelism hiding?

Ralph Smith

unread,

Oct 31, 2016, 1:09:07 AM10/31/16

to julia-users

I looked a bit deeper (i.e. found a machine where I have access to an Intel compiler, albeit not up to date - my shop is cursed by budget cuts). ICC breaks up a loop like

for (i=0; i<n; i++) {
  a[i] = exp(cos(b[i]));
  s += a[i];
}

into calls to vector math library functions and a separate loop for the sum. The library is bundled with ICC; it's not MKL, but its domain overlaps with MKL -

hence my misapprehension - so your point stands. Something like blackscholes benefits from these vector library calls, and GCC doesn't do that.

It would be nice if Julia's LLVM system included an optimization pass which invoked a vector math library when appropriate. I guess that's a challenge outside

the scope of ParallelAccelerator, but maybe good ground for some other project.

Reply all

Reply to author

Forward