Fast vector element-wise multiplication

1,630 views
Skip to first unread message

Lionel du Peloux

unread,
Oct 6, 2015, 10:28:29 AM10/6/15
to julia-users
Dear all,

I'm looking for the fastest way to do element-wise vector multiplication in Julia. The best I could have done is the following implementation which still runs 1.5x slower than the dot product. I assume the dot product would include such an operation ... and then do a cumulative sum over the element-wise product.

The MKL lib includes such an operation (v?Mul) but it seems OpenBLAS does not. So my question is :

1) is there any chance I can do vector element-wise multiplication faster then the actual dot product ?
2) why the built-in element-wise multiplication operator (*.) is much slower than my own implementation for such a basic linealg operation (full julia) ?

Thank you,
Lionel

Best custom implementation :

function xpy!{T<:Number}(A::Vector{T},B::Vector{T})
  n
= size(A)[1]
 
if n == size(B)[1]
   
for i=1:n
     
@inbounds A[i] *= B[i]
   
end
 
end
 
return A
end

Bench mark results (JuliaBox, A = randn(300000) :

function                          CPU (s)     GC (%)  ALLOCATION (bytes)  CPU (x)     
dot(A,B)                          1.58e-04    0.00    16                  1.0         
xpy!(A,B)                         2.31e-04    0.00    80                  1.5         
NumericExtensions.multiply!(P,Q)  3.60e-04    0.00    80                  2.3         
xpy!(A,B) - no @inbounds check    4.36e-04    0.00    80                  2.8         
P.*Q                              2.52e-03    50.36   2400512             16.0        
############################################################

Tomas Lycken

unread,
Oct 6, 2015, 11:04:31 AM10/6/15
to julia-users
I made some simple changes to your `xpy!`, and managed to get it to allocate nothing at all, while performing very close to the speed of `dot`. I don't know anything about e.g. `@simd` instructions, but I imagine they could help speeding this up even further.

The most significant change was switching `size(A)[1]` to `size(A,1)` (and similarly for `B`) - the former has to construct and index into a tuple, while the latter won't have to do that. `length(A)` would have worked too.


// T

Patrick Kofod Mogensen

unread,
Oct 6, 2015, 11:33:22 AM10/6/15
to julia-users
Well, I guess your table pretty much shows it, right? It seems as it allocates a lot of temporary memory to carry out the calculations.

Christoph Ortner

unread,
Oct 6, 2015, 12:29:04 PM10/6/15
to julia-users
a *= b is equivalent to a = a * b, which allocates a temporary variable I think?

Try 

@fastmath @inbounds @simd for i=1:n

Christoph Ortner

unread,
Oct 6, 2015, 12:29:40 PM10/6/15
to julia-users
or, possibly A[i] = A[i] * B[i]

(I'm not sure whether @simd automatically translates *= to what it needs)

Steven G. Johnson

unread,
Oct 6, 2015, 1:52:18 PM10/6/15
to julia-users


On Tuesday, October 6, 2015 at 12:29:04 PM UTC-4, Christoph Ortner wrote:
a *= b is equivalent to a = a * b, which allocates a temporary variable I think?

A * A only allocates memory on the heap if A is an array or something other heap-allocated datatype.   For A[i] *= B[i] where A[i] and B[i] are small scalar types like Float64, no temporary is allocated, the compiler just puts the result in a register.

Patrick Kofod Mogensen

unread,
Oct 6, 2015, 2:23:33 PM10/6/15
to julia-users
That was supposed to be "A * B only allocates..." right?

Steven G. Johnson

unread,
Oct 6, 2015, 3:40:21 PM10/6/15
to julia-users


On Tuesday, October 6, 2015 at 2:23:33 PM UTC-4, Patrick Kofod Mogensen wrote:
That was supposed to be "A * B only allocates..." right?

Yes.

Steven G. Johnson

unread,
Oct 6, 2015, 3:46:24 PM10/6/15
to julia-users
Note that the BLAS dot product probably uses all sorts of tricks to squeeze the last cycle of SIMD performance out of the CPU.  e.g. here is the OpenBLAS ddot function for SandyBridge, which is hand-coded in assembly:


Getting the last 30% or so of this performance can be extremely tricky.

Lionel du Peloux

unread,
Oct 6, 2015, 4:31:36 PM10/6/15
to julia-users

Thank you for all of your suggestions.
The @simd macro effectively gives a (very) slightly improved performance (5%).

chobb...@gmail.com

unread,
Jun 20, 2016, 6:38:15 AM6/20/16
to julia-users
I have the same question regarding how to calculate the entry-wise vector product and find this thread. As a novice, I wonder if the following code snippet is still the standard for entry-wise vector multiplication that one should stick to in practice? Thanks!


@fastmath @inbounds @simd for i=1:
n
A
[i] *= B[i]
end

Chris Rackauckas

unread,
Jun 20, 2016, 8:57:25 AM6/20/16
to julia-users
I think that for medium size (but not large) arrays in v0.5 you may want to use @threads from the threadding branch, and then for really large arrays you may want to use @parallel. But you'd have to test some timings.

chobb...@gmail.com

unread,
Jun 20, 2016, 9:50:22 AM6/20/16
to julia-users
Thanks! I'm still using v0.4.5. In this case, is the code I highlighted above still the best choice for doing the job?

Chris Rackauckas

unread,
Jun 20, 2016, 10:05:31 AM6/20/16
to julia-users
Most likely. I would also time it with and without @simd at your problem size. For some reason I've had some simple loops do better without @simd. 

chobb...@gmail.com

unread,
Jun 20, 2016, 10:23:50 AM6/20/16
to julia-users
Thanks for the confirmation! Yes, I need more tests to see what the best practice is for my particular problem.

Sheehan Olver

unread,
Nov 1, 2016, 7:39:15 PM11/1/16
to julia-users
Should this be added to a package?  I imagine if the arrays are on the GPU (AFArrays) then the operation could be much faster, and having a consistent name would be helpful.

Chris Rackauckas

unread,
Nov 1, 2016, 9:55:51 PM11/1/16
to julia-users
This is pretty much obsolete by the . fusing changes:

A .= A.*B

should be an in-place update of A scaled by B (Tomas' solution).

Sheehan Olver

unread,
Nov 1, 2016, 10:06:39 PM11/1/16
to julia...@googlegroups.com
Ah, good point.  Though I guess that won't work til 0.6 since .* won't auto-fuse yet? 

Sent from my iPhone

Tom Breloff

unread,
Nov 1, 2016, 10:27:40 PM11/1/16
to julia-users
As I understand it, the .* will fuse, but the .= will not (until 0.6?), so A will be rebound to a newly allocated array.  If my understanding is wrong I'd love to know.  There have been many times in the last few days that I would have used it...

Chris Rackauckas

unread,
Nov 1, 2016, 10:51:54 PM11/1/16
to julia-users
It's the other way around. .* won't fuse because it's still an operator. .= will. It you want .* to fuse, you can instead do:

A .= *.(A,B)

since this invokes the broadcast on *, instead of invoking .*. But that's just a temporary thing.

Sheehan Olver

unread,
Nov 1, 2016, 11:06:12 PM11/1/16
to julia...@googlegroups.com
Ah thanks!

Though I guess if I want the same code to work also on a GPU array then this won't help?

Sent from my iPhone

Chris Rackauckas

unread,
Nov 2, 2016, 3:35:54 AM11/2/16
to julia-users
Yes, this most likely won't help for GPU arrays because you likely don't want to be looping through elements serially: you want to call a vectorized GPU function which will do the computation in parallel on the GPU. ArrayFire's mathematical operations are already overloaded to do this, but I don't think they can fuse.

Sheehan Olver

unread,
Nov 2, 2016, 3:41:43 AM11/2/16
to julia...@googlegroups.com

OK, good to know. I think putting the function in a package is overkill.

Tim Holy

unread,
Nov 2, 2016, 4:17:13 AM11/2/16
to julia...@googlegroups.com
Hmm, that's surprising. Looks like we're using generic broadcasting machinery for that operation (check out what @which P.*P returns). Might be good to add .* to this line: https://github.com/JuliaLang/julia/blob/b7f1aa7554c71d3759702b9c2e14904ebdc94199/base/arraymath.jl#L69. Want to make a pull request?

Best,
--Tim

Patrick Kofod Mogensen

unread,
Nov 2, 2016, 5:52:14 AM11/2/16
to julia-users
Does that work for you? I have to write

A .= (*).(A,B)
Reply all
Reply to author
Forward
0 new messages