NumericExtensions extended

91 views

package

Skip to first unread message

Dahua

unread,

Feb 16, 2014, 2:46:45 PM2/16/14

to juli...@googlegroups.com

NumericExtensions are significantly updated recently.

The major changes:

- Reinstate/introduce reduce, mapreduce, reducedim, mapreducedim methods using functors

- All map, reducedim, and mapreducedim functions and functions derived thereon now support both contiguous arrays and strides views (those from ArrayViews.jl) with very efficient implementation.

- the PDMat part is separated to a new package PDMats.jl

Internal implementation strategy is recursion (computation over N-D arrays are decomposed into computations over (N-1)-D slices) + highly optimized 1D/2D kernels.

Here is some benchmarks obtained by running computations over 1000x1000 matrices

a_sub = sub(a, 1:999, :)
a_view = view(a, 1:999, :)   # using ArrayViews

for sum:
  dim = 1:  sum(a_sub, dim) => 0.1314s   sum(a_view, dim) => 0.0374s   |  gain = 3.5168x
  dim = 2:  sum(a_sub, dim) => 0.1740s   sum(a_view, dim) => 0.0574s   |  gain = 3.0286x
for sumabs:
  dim = 1:  sum(abs(a_sub), dim) => 0.6341s   sumabs(a_view, dim) => 0.0286s   |  gain = 22.1343x
  dim = 2:  sum(abs(a_sub), dim) => 0.6331s   sumabs(a_view, dim) => 0.0639s   |  gain = 9.9013x

Note: the ArrayViews package now provides a ``ellipview`` function, as

ellipview(a, i) # equivalent to view(a, ..., i), e.g. view(a, :,:,i) when ndims(a) == 3

Some thoughts:

Tim's Cartesian machinery is very nice and has been in the Julia Base. It can be used to express generic algorithms using a small number of lines of codes, while producing reasonable run-time performance. However, for writing really high-performance implementation, recursion + highly optimized kernels is still faster (in run time), and this strategy allows one to plug-in SIMD, BLAS or other external functions for optimal performance (through multiple dispatch).

Consider the following code:

sumabs(view(a, 1:500, 1:2:5, :), (1, 2))

The strategy implemented in NumericExtensions will invoke ``BLAS.asum`` when performing computation along each column. That's why you will see over 20x speed gain above.