Some thoughts about packages for fast vectorized computation

Dahua

unread,

Oct 9, 2013, 12:36:29 PM10/9/13

to juli...@googlegroups.com

I created two packages: Devectorize and NumericExtensions for faster vectorized computation.

However, I am not completely satisfied with the current status:

The more I use them, the more I feel that the current API remains too restrictive -- they are only useful for simple cases, such as element-wise mapping of a single function, or running pre-defined reduction operations.
These two packages try to address the similar issue. Whereas they take different approaches, a considerable portion of the internal code-generation facilities are similar -- this is a clear indication that some refactoring is needed.

I am considering merge them into a single package. These would have two benefits:

People only have to go to one place when they need some tools to accelerate their computation routines.
They can share internal code generation facilities.
It may provide a richer API by combining functionalities from both.

Particularly, I am considering merging Devectorize into NumericExtensions -- this package provides much more than just devectorizing codes. Instead, it tries to provide a flexible yet consistent solution for fast vectorized computation.

Here are some of the things in my mind to be implemented in the package:

1. Some key macros for fast code generation. Particularly, I may implement a @map macro, as

r = @map((x, y) -> sin(x) + cos(y), a, b)

The macro may emit something similar to the following code

r = #some code to create an array of proper size and type
for i = 1 : length(r)
   @inbounds r[i] = sin(a[i]) + cos(b[i])
end

2. The @map macro may support local variables and multiple outputs, as

(u, v) = @map((x, y) ->(s=sin(x); c=cos(y); (s+c, s-c)), a, b)

This may emit fused loop as follows

for i = 1 : length(u)
    @inbounds s = sin(a[i])
    @inbounds c = cos(b[i])
    @inbounds u[i] = s + c
    @inbounds v[i] = s - c
end

Actually, using this way, one can easily map a relatively complicated kernel.

3. An inplace version @map! may also be provided.

4. It will also provide a macro @mapfunc and @mapfunc! to create standalone functions that can be evaluated multiple times later.

5. A @fold macro. With this macro, one can write sum of squared differences as follows

ssd = @fold(s += abs2(x - y), zero, a, b)

This may emit an optimized routine as follows

ssd = begin
    R = #infer result type
    n = #infer common array length
    s = zero(R)
    for i = 1 : n
        @inbound s += abs2(a[i] - b[i])
    end
    s
end

6. @reduce and @mapreduce macros (which is probably implemented based on @fold) with simplified syntax, as

s = @reduce(+, a)
ssq = @mapreduce(abs2, +, a)
ssd = @mapreduce((x, y) -> abs2(x - y), +, a, b)

7. Macros @foldfunc, @reducefunc, and @mapreducefunc may generate named standalone functions.

8. Provide macros @reducedim and @reducedimfunc for reduction along dimensions.

9. Specific reduction macros: @sum, @mean, @max, @min, etc. With these, one can write

s = @sum((x, y) -> abs2(x -y), a, b)
m = @max(abs, a)
r = @mean(abs, a, 1)  # along columns

10. Functors will be deprecated, as they are too restrictive and cumbersome to construct.

11. The macro @devec will continue to be provided, but using a slightly different syntax:

#old syntax:  @devec r = sin(a) + cos(b)
#new syntax:
r = @devec sin(a) + cos(b)

This makes it more flexible, especially when defining functions. For example,

myfunc(a, b) = @devec sin(a) + cos(b)

12. @devec will support blocks and try to fuse the loops when it makes sense to do so, as

u, v = @devec begin
   s = sin(a)
   c = cos(b)
   (s + c, s - c)
end

13. Additionally, an inplace version @devec! will be provided, as

@devec! u v begin
   s = sin(a)
   c = cos(b)
   u = s + c
   v = s - c
end

This will be developed under a new branch of NumericExtensions. When this branch merges to the master, NumericExtensions will be tagged as version 0.3.x.

Please let me know if you have suggestions and feedback. I really appreciate it and may try to incorporate helpful suggestions to this work.

Tim Holy

unread,

Oct 9, 2013, 1:16:02 PM10/9/13

to juli...@googlegroups.com

I struggled with such issues in developing Cartesian, which has similar
"efficient computing" goals and underwent a radical design change from version
0.0 to 0.1. At one point (in the 0.0 days) I started developing custom macros
for small variants of the same overall tasks, and eventually it blew out of
proportion. I junked that approach in favor of a small set of "tiny tools"
that make it easy (albeit ugly) to do fairly arbitrary things. That said, I
think you're aiming for something that is less developer-facing and more user-
facing, so it may not be a relevant experience.

And while I'm hesitant to recommend it because I think of Cartesian as a
"transitional" package (I really hope it goes away someday) ... if you want
your reductions/maps to work in "arbitrary" dimensions and efficiently for
SubArrays, you may want to consider basing on Cartesian. It makes it easy to
write multidimensional algorithms that naturally come out cache-efficient. We
might want one new macro to make it easier to implement reductions.

--Tim

On Wednesday, October 09, 2013 09:36:29 AM Dahua wrote:
> I created two packages:

> Devectorize<https://github.com/lindahua/Devectorize.jl>and
> NumericExtensions <https://github.com/lindahua/NumericExtensions.jl> for

> faster vectorized computation.
>
> However, I am not completely satisfied with the current status:
>

> - The more I use them, the more I feel that the current API remains too

> restrictive -- they are only useful for simple cases, such as
> element-wise mapping of a single function, or running pre-defined reduction

> operations. - These two packages try to address the similar issue. Whereas

> they take different approaches, a considerable portion of the internal
> code-generation facilities are similar -- this is a clear indication that
> some refactoring is needed.
>
> I am considering merge them into a single package. These would have two
> benefits:
>

> - People only have to go to one place when they need some tools to
> accelerate their computation routines.
> - They can share internal code generation facilities.
> - It may provide a richer API by combining functionalities from both.

Viral Shah

unread,

Oct 9, 2013, 3:15:04 PM10/9/13

to juli...@googlegroups.com

This sounds great. I would love to see one package for fast vectorized computation. It also seems that some of this machinery should eventually live in compiler passes.

-viral

Dahua

unread,

Oct 9, 2013, 3:22:58 PM10/9/13

to juli...@googlegroups.com

Hi, Tim,

I really appreciate your opinion, and I understand that some computation (e.g. image filtering) can not be expressed by such macros (or even a greater set of macros). Nonetheless, in practice, map-reduce type of computation constitutes a significant portion of real world codes. The primary goal of this effort is to provide tools such that users can easily express such computation in a concise way (without writing loops every time) and the computation will be carried out in a way that is as efficient as possible.

The Cartesian package is a great package, and provides much greater flexibility to write generic loops. I will definitely look at using it to support the computation for strided arrays of arbitrary dimensions. Since the API of Cartesian is at a relatively lower level, it is still useful to provide tools for users, such that they can finish a lot of their tasks with simple algorithmic patterns using a simpler interface.

- Dahua

Tim Holy

unread,

Oct 9, 2013, 4:11:42 PM10/9/13

to juli...@googlegroups.com

On Wednesday, October 09, 2013 12:22:58 PM Dahua wrote:
> Nonetheless, in practice, map-reduce type of
> computation constitutes a significant portion of real world codes. The
> primary goal of this effort is to provide tools such that users can easily
> express such computation in a concise way (without writing loops every
> time) and the computation will be carried out in a way that is as efficient
> as possible.

Absolutely. I just meant that it might make _your_ life easier as you begin
this project, not that it's a substitute for the package you're trying to put
together. If you want to implement some of those maps and reductions, like

maxoverdims(A, (dim1, dim2, ...))

then using Cartesian (with one simple extra trick that I should just add) it
could be implemented like this:

sz = [size(A,d) for d = 1:$N]
szout[[region...]] = 1
B = nans(T, szout...)::Array{T,$N}
@nextract $N szout szout
@nloops $N i A d->(j_d = szout_d==1 ? 1 : i_d) begin
(@nref $N B j) = max((@nref $N B j), (@nref $N A i))
end

In 7 lines you get cache-efficiency (one traversal over A in the order that it's
stored in memory), excellent performance for both Arrays and SubArrays,
support for arbitrary dimensions, and support for arbitrary choices of how
many and which dimensions the user wants to reduce over. Ta-dah!

Cartesian wasn't in existence (or at least, not in good shape) when you wrote
NumericExtensions. But given that your NumericExtensions/reduce.jl is over 400
lines and needs special cases for reducing over two dimensions at once, I
thought maybe I could make your life easier the second time around :-). I just
need to implement support for that extra next-to-last argument of @nloops.

--Tim

Stefan Karpinski

unread,

Oct 9, 2013, 4:27:30 PM10/9/13

to Julia Dev

I have to confess that seeing all these macros breaks my heart a little. Macros are great as the "ultimate escape valve", but if you need to use them this much, it indicates to me that there are some basic problems that the language isn't addressing well enough yet. Fortunately, macros are a great way to sketch and experiment with language features without having to change the language. I'm hoping that we can obviate the need for some of this stuff in the future. Some of it is just a matter of optimization – e.g. making the use of higher-order functions just as efficient as if you'd used a macro to generate custom code – while other things may require fundamental semantic extensions – e.g. loop fusion.

Dahua

unread,

Oct 9, 2013, 4:50:42 PM10/9/13

to juli...@googlegroups.com

Hi, Stefan,

Julia, at this point, has great performance when you write low-level codes (e.g. for-loops). This is of course great and unlike other languages, you don't have to resort to C to get performance! However, improvement remains needed for higher-level codes (e.g. vectorized expressions, or functional codes).

For example, "map(x -> abs2(x), a)" or ``map(abs2, a)" is still much slower than ``abs2(a)`` with a macro-generated vectorized function. That's why Julia base relies on @vectorize_arg1 or similar macros to generate vectorized math functions. Part of this issue is that function arguments still can not be properly inlined within a higher-level function. Therefore, for loops are needed when performance is critical -- this motivates me to write several macros to help me write the for loops, such that I myself can write more concise codes.

I am expecting that the core team will address this issue -- at that point, many of these macros will no longer be needed, which is actually what I am looking forward to.

- Dahua

Tim Holy

unread,

Oct 9, 2013, 4:56:12 PM10/9/13

to juli...@googlegroups.com

On Wednesday, October 09, 2013 04:27:30 PM Stefan Karpinski wrote:
> I have to confess that seeing all these macros breaks my heart a little.
> Macros are great as the "ultimate escape valve", but if you need to use
> them this much, it indicates to me that there are some basic problems that
> the language isn't addressing well enough yet.

Absolutely. It's why I want Cartesian to go away, but in the meantime it is
necessary. We're talking performance advantages of hundred-fold or more, on
"mission-critical" code. Perhaps even more to the point, Cartesian is the first
thing I've found that makes it fun (rather than a chore) to write algorithms
(simple or complex) that support arbitrary dimensions.

But as number of things that Cartesian can do has expanded, I've become a
little more skeptical that we'll be be there soon, because I'm guessing it
won't be trivial to design nice syntax to do all these things. I suppose
that's made me less hesitant to use it. But I will be impressed, and _very_
pleased, when someone proves me wrong!

--Tim

Tim Holy

unread,

Oct 9, 2013, 8:46:14 PM10/9/13

to juli...@googlegroups.com

One additional thought:

On Wednesday, October 09, 2013 04:27:30 PM Stefan Karpinski wrote:

> I have to confess that seeing all these macros breaks my heart a little.

Stefan, if it makes you feel any better, or worse :)... the reality is that,
in base/, we already depend heavily on generated code for anything dealing
with multidimensions. Think make_loop_nest, gen_cartesian_map, etc. These
don't see the light of day by users very often, but the reality is that they
underlie a surprisingly large fraction of what people do with Julia. Like
Cartesian, they're also not very readable or user-friendly (they're not even
documented). Cartesian at least takes a big step in what I think is the right
direction, allowing you to write code that is reminiscent of the way you'd
like to write it (similar to issue #1917, albeit with much uglier syntax), and
is more modular. Hopefully when that issue is resolved we can largely do a
drop-in replacement for code written with Cartesian.

--Tim

Viral Shah

unread,

Oct 9, 2013, 10:25:15 PM10/9/13

to juli...@googlegroups.com

I really do believe that many of these concerns will get addressed as early as 0.3.

I agree with Tim about many such things already in base, and some of Dahua's stuff will as well, once the basic machinery for better vectorized computations is implemented.

-viral

Stefan Karpinski

unread,

Oct 10, 2013, 5:14:34 PM10/10/13

to Julia Dev

Right. What I mean basically, is that I think we need a few more things like the multidimensional iteration syntax from #1917, function argument inlining, and some mechanisms for expressing vectorized fusable code better to chip away at things that require nasty metaprogramming. In the meantime, what you guys are doing and proposing is a good way forward.

Tim Holy

unread,

Oct 10, 2013, 5:40:32 PM10/10/13

to juli...@googlegroups.com

We're all on the same page, then. Except I think this:

https://github.com/timholy/Cartesian.jl/issues/3#issuecomment-26055489

is much closer to what we need. The one in #1917 does not have enough
expressive power.

--Tim

On Thursday, October 10, 2013 05:14:34 PM Stefan Karpinski wrote:
> Right. What I mean basically, is that I think we need a few more things
> like the multidimensional iteration syntax from

> #1917<https://github.com/JuliaLang/julia/issues/1917>,

Tom Short

unread,

Oct 11, 2013, 3:28:59 PM10/11/13

to julia-dev

Dahua, one use case I'd like to keep from Devectorize is assignment into an existing array:

@devec a[:] = x + y .* sum(z)

For example, `a` could be a DataArray or an mmapped array.

Dahua

unread,

Oct 11, 2013, 3:34:12 PM10/11/13

to juli...@googlegroups.com

Tom,

I am looking into this slightly modified syntax:

@devec!(a, x + y .* sum(z))

See point 12 in the plan above. I am also planning to allow writing in-place to multiple outputs with a fused loop.

- Dahua

Tom Short

unread,

Oct 11, 2013, 3:36:07 PM10/11/13

to julia-dev

Great. Thanks, Dahua.

Dahua

unread,

Oct 14, 2013, 12:06:07 PM10/14/13

to juli...@googlegroups.com

Implementation of this plan is underway (see https://github.com/lindahua/NumericExtensions.jl/pull/16)

In addition to the points proposed above, more optimization techniques will be implemented in the code generation, which includes constant propagation and scalar identification.

For example, the following expression

@devec (x + y) * (sin(2.0) + cos(2.0))

will emit codes as below

r = #create an array of proper type and shape
tmp1 = sin(2.0) + cos(2.0)


for i = 1 : length(r)


    @inbounds r[i] = (x[i] + y[i]) * tmp1
end

Here, it will recognize ``sin(2.0) + cos(2.0)`` as a constant scalar, compute its value in advance, and store it into a temporary variable (made using gensym). Also, it will recognize that this temporary variable is a scalar, therefore the ``(x + y) * tmp1`` can be expanded as element-wise mapping. (Note: if we don't know tmp1 is a scalar, then ``(x + y) * tmp1`` may be matrix multiplication).

Another important aspect is that the code generation for @map and devectorization of expression will be coupled. For example,

@map((x, y)-> sin(f(x) + g(y)), a.^2 + b, b.^3)

will emit something similar to the following

for i = 1 : length(r)


    @inbounds xi = a[i]^2 + b[i]
    @inbounds yi = b[i]^3
    @inbounds r[i] = sin(f(xi) + g(yi))
end

Also, I will retain functors after thinking about it carefully.

Despite the greater flexibility of @map & @devec macro, functors still have an important advantage -- it can be passed around as an argument to higher-level functions (@map and other macros can only recognize a pre-defined set of function names). This facilities are being utilized in other packages.

Therefore, functors won't be deprecated until function arguments can be specialized & inlined by the Julia compiler. Instead, in the new version of this package, I will try to provide easier and more user-friendly way to introduce new functors.

In terms of internal design, kernel composition and index generation will be decoupled as two modules, which will work together in final code generation. With this design, a kernel can work with different kinds of indexing scheme -- for example, for contiguous arrays, a linear indexing will be used, while for non-contiguous strided arrays, it may use Tim's Cartesian facility to generate indexes.

- Dahua

Dahua

unread,

Oct 14, 2013, 12:19:26 PM10/14/13

to juli...@googlegroups.com

On Monday, October 14, 2013 11:06:07 AM UTC-5, Dahua wrote:

Implementation of this plan is underway (see https://github.com/lindahua/NumericExtensions.jl/pull/16)

The PR has been renamed, and its new address is https://github.com/lindahua/NumericExtensions.jl/pull/18.

Reply all

Reply to author

Forward