Micro-benchmarking Vandermonde matrix generation

142 views
Skip to first unread message

Joshua Adelman

unread,
Jan 3, 2015, 11:27:00 PM1/3/15
to julia...@googlegroups.com
I'm a long time Python/Numpy user and am starting to play around with Julia a bit. To get a handle on the language and how to write fast code, I've been implementing some simple functions and then trying to performance tune them. One such experiment involved generating Vandermonde matrices and then comparing timings vs the method that numpy supplies (np.vander). The code for two simple implementations that I wrote, plus the method supplied by MatrixDepot.jl are in this notebook along with timings for each and the corresponding timing for the numpy method on the same machine. 


Generally Julia fares pretty well against numpy here, but does not necessarily match or beat it over all array sizes. The methods I wrote are similar to the numpy implementation and are typically faster than what's in MatrixDepot.jl, but I was hoping someone with a bit more experience in Julia might have some further tricks that would be educational to see. I've already looked at the Performance Tips section of the documentation, and I think I'm following best practices. 

Suggestions and feedback are appreciated as always.

Josh

PS - maybe it's my experience with numpy, but I wonder if any thought has been given to following more of a numpy-style convention of always returning a view of an array when possible rather than a copy? This is often a source of confusion with new numpy users, but once you realize that generally, if the slice or transformation can be represented by a fixed stride through the data, then you get a view, it's pretty intuitive. 

Jiahao Chen

unread,
Jan 3, 2015, 11:56:20 PM1/3/15
to julia...@googlegroups.com

On Sat, Jan 3, 2015 at 11:27 PM, Joshua Adelman <joshua....@gmail.com> wrote:
PS - maybe it's my experience with numpy, but I wonder if any thought has been given to following more of a numpy-style convention of always returning a view of an array when possible rather than a copy?

Search the Github issues for ArrayViews.

All 3 versions of your Julia code have assignments of the form

v2[:, k] = p

which could be faster if you explicitly devectorize.

Thanks,

Jiahao Chen
Staff Research Scientist
MIT Computer Science and Artificial Intelligence Laboratory

Joshua Adelman

unread,
Jan 4, 2015, 12:09:24 AM1/4/15
to julia...@googlegroups.com
Hi Jiahao,

Actually when I devectorized vander2 (explicitly looping over the first dimension as the inner-most loop) as suggested, while cases where N < 100 are indeed faster, for N > 100 the devectorized code is slower.

Josh 

Jiahao Chen

unread,
Jan 4, 2015, 12:54:42 AM1/4/15
to julia...@googlegroups.com
That's an interesting observation.

Vandermonde matrices are actually quite an interesting test of unusual floating point computations. Constructing Vandermonde matrices explicitly very quickly results in matrix elements that overflow to +/-Inf or underflow toward 0 unless your starting vector has entries that are all exactly of magnitude 1. Hence the larger problems you are computing on [1:N] end up testing largely the speed of multiplying finite numbers by infinity. Conversely had your tests used as input rand(N) (uniform random numbers between 0 and 1), you would have tested largely the speed of multiplying denormalized floats and zeros.

On my machine, the speed penalty from denormalized floats turns out to be significant, but not the +Infs.

julia> gc()

julia> x=[(-1.0)^n for n=1:10^4]; @time vander2(x); #Lots of multiplications involving +1.0 and -1.0
elapsed time: 1.216966083 seconds (1600000400 bytes allocated, 6.44% gc time)

julia> gc()

julia> x=[1.0:10000.0]; @time vander2(x); #Lots of multiplications involving +Inf
elapsed time: 1.206602925 seconds (1600000400 bytes allocated, 6.04% gc time)

julia> gc()

julia> x=rand(10000); @time vander2(x); #Lots of multiplications involving 0s and denormalized floats
elapsed time: 2.952932852 seconds (1600000400 bytes allocated, 2.82% gc time)

On my machine, the equivalent computations using numpy are significantly slower:

>>> x=[(-1.0)**n for n in range(1,10001)]; start=time.clock(); np.vander(x); time.clock()-start
...
3.961352000000005
>>> x=[n for n in range(1,10001)]; start=time.clock(); np.vander(x); time.clock()-start
...
5.036712999999999
>>> x=np.random.rand(10000); start=time.clock(); np.vander(x); time.clock()-start
...
5.462721000000002

Reply all
Reply to author
Forward
0 new messages