I've written my first actual Julia code. I deliberately started on a "small"
project (and I might do a few more of those) before getting more ambitious. I
tried to pick something that doesn't yet seem to be part of the library or
example code (I could be wrong), and I decided to implement a simple priority
queue as a heap. I've attached the code; I'd be very grateful for critical
feedback, since that's the only way I'll learn.
I have several questions:
1. If this is useful, once it gets polished should it go in examples/?
2. If so, should I keep mailing patches to the list?
3. How does one handle documentation? You'll notice I put some usage examples
at the top of the file, in an attempt to be a bit helpful.
Long-term, are there plans for any kind of "help system" within Julia? Matlab
displays the initial block of comments inside the function definition when you
type "help myfunc"; it also has a pretty-printed documentation system which is
entirely divorced from those comments in the code, which can be confusing if
the two contradict each other. It might be worth keeping this issue in mind in
terms of developing policy for a future some day in which I can imagine there
might be a graphical IDE, etc.
Best,
--Tim
# These functions define a min-heap
# Example usage of the Heap object:
# You can create an empty min-heap this way:
# h = Heap(Float64)
# This defaults to using isless for comparison.
# A more complex and complete example, yielding a max-heap:
# isgreater(x,y) = x > y
# h = Heap(isgreater,rand(3))
# for i = 1:10
# push!(h,rand())
# end
# length(h)
# max_val = pop!(h)
#
# Example using pure vectors (avoids making any copies):
# z = rand(8)
# vector2heap!(z)
# isheap(z)
# min_val = heap_pop!(z)
# heap2rsorted!(z)
## Function definitions for operating on vectors ##function _heapify!{T}(lessthan::Function,x::Vector{T},i::Int,len::Int)
# Among node i and its two children, find the smallest
il = 2*i; # index of left child
if il > len
return
end
ismallest = lessthan(x[il],x[i]) ? il : i # index of smallest child
if il < len
ismallest = lessthan(x[il+1],x[ismallest]) ? il+1 : ismallest
end
if ismallest != i
# Put the smallest at i
tmp = x[ismallest]
x[ismallest] = x[i]
x[i] = tmp
# Recursively re-heapify
_heapify!(lessthan,x,ismallest,len)
end
endfunction vector2heap!{T}(lessthan::Function, x::Vector{T})
for i = convert(Int,ifloor(length(x)/2)):-1:1
_heapify!(lessthan,x,i,length(x))
end
end
function vector2heap!{T}(x::Vector{T})
vector2heap!(isless,x)
endfunction isheap{T}(lessthan::Function, x::Vector{T})
for i = 1:convert(Int,ifloor(length(x)/2))
i2 = 2*i
if !lessthan(x[i],x[i2])
return false
end
if i2 < length(x) && !lessthan(x[i],x[i2+1])
return false
end
end
return true
end
function isheap{T}(x::Vector{T})
isheap(isless,x)
endfunction heap_push!{T}(lessthan::Function, x::Vector{T},item::T)
# append new element at the bottom
push(x,item)
# let the new item percolate up until it has a smaller parent
i = length(x)
ip = convert(Int,ifloor(i/2)) # index of parent
while i > 1 && lessthan(x[i],x[ip])
# swap i and its parent
tmp = x[ip]
x[ip] = x[i]
x[i] = tmp
# traverse up the tree
i = ip
ip = convert(Int,ifloor(i/2)) # index of parent
end
end
function heap_push!{T}(x::Vector{T},item::T)
heap_push!(isless,x,item)
endfunction heap_pop!{T}(lessthan::Function, x::Vector{T})
min_item = x[1]
x[1] = x[end]
pop(x)
_heapify!(lessthan,x,1,length(x))
return min_item
end
function heap_pop!{T}(x::Vector{T})
min_item = heap_pop!(isless,x)
return min_item
endfunction heap2rsorted!{T}(lessthan::Function, x::Vector{T})
for i = length(x):-1:2
tmp = x[1]
x[1] = x[i]
x[i] = tmp
_heapify!(lessthan,x,1,i-1)
end
end
function heap2rsorted!{T}(x::Vector{T})
heap2rsorted!(isless,x)
end## Heap object ##
type Heap{T}
lessthan::Function
data::Array{T,1}function Heap(lessthan::Function,x::Array{T,1})
data = copy(x)
vector2heap!(lessthan,data)
new(lessthan,data)
end
end
Heap{T}(lessthan::Function,x::Vector{T}) = Heap{T}(lessthan,x)
Heap{T}(x::Vector{T}) = Heap{T}(isless,x)
Heap{T}(lessthan::Function,::Type{T}) = Heap{T}(lessthan,zeros(T,0))
Heap{T}(::Type{T}) = Heap{T}(isless,zeros(T,0))function push!{T}(h::Heap{T},item::T)
heap_push!(h.lessthan,h.data,item)
endfunction pop!{T}(h::Heap{T})
min_item = heap_pop!(h.lessthan,h.data)
return min_item
endfunction length{T}(h::Heap{T})
return length(h.data)
end
On Tuesday, March 13, 2012 08:41:58 pm Mateusz Kubica wrote:
> Hi,
> I'm just wondering if wouldn't be clearer to replace
> # isgreater(x,y) = x > y
> # h = Heap(isgreater,rand(3))
> by simply
> # h = Heap((>),rand(3))
> Haskell uses such a notation and I find it very compact and readable.
Changed as suggested (without the parens, as Stefan suggested). I just sent a
pull request; I'm quite new to git, so if you notice problems I'll try to fix.
I did a little timing:
julia> n = 100000
100000
julia> x = rand(n)
[0.310245, 0.0673555, 0.705961, 0.871099 ... 0.691175, 0.608054, 0.0276982,
0.563395]
julia> tic(); vector2heap!(x); toc()
elapsed time: 0.03545689582824707 seconds
0.03545689582824707
julia> tic(); for i = 1:n; heap_pop!(x); end; toc()
elapsed time: 0.37431907653808594 seconds
0.37431907653808594
There's a Matlab implementation
(http://www.mathworks.com/matlabcentral/fileexchange/34218-heap) whose
implementation details appear to be very close to what I did for the Julia
version. Timing:
>> n = 100000;
>> tic; h = Heap(rand(1,n)); toc
Elapsed time is 15.433420 seconds.
>> tic; for i = 1:n; mx = heapExtractMax(h); end; toc
Elapsed time is 262.971887 seconds.
So the Julia version absolutely smokes the Matlab version.
However, my Julia code is about 10- to 20-fold slower than a C++ STL-based
implementation, which is attached and can be compiled (on a Linux system)
with:
g++ -Wl,--rpath,/usr/local/lib/ -I/home/tim/src/boost_1_48_0 -L/usr/local/lib/
-lboost_chrono pq_test.cpp -o pq_test
(Obviously you'd have to change some of the paths here)
Here are the timing results, first without optimization:
$ ./pq_test
make_heap took 0.00302657 seconds
pop_heap took 0.0290537 seconds
And with -O2 optimization:
$ ./pq_test
make_heap took 0.00188609 seconds
pop_heap took 0.0165661 seconds
I should say that I didn't copy the C++ STL code in writing my Julia code (I
usually find the standard library hard to read and understand), and obviously
the STL is a highly-optimized library that may use a number of tricks that
could in principle be borrowed to get better performance. But, any obvious
things I should fix? I presume there is no Julia profiler yet?
Best,
--Tim
This is a long email; the gist is that Julia's performance seems to be
fantastic when you can avoid recursion/lots of function calls, and may suffer
compared to C++ when you cannot. More details below.
On Tuesday, March 13, 2012 11:33:16 pm Stefan Karpinski wrote:
> Nice that we're beating Matlab handily.
Indeed! It's huge.
> STL is still the gold standard for
> these kinds of things. I guess we have something to strive for ;-)
I did more testing, and it turns out that while STL is unquestionably good,
merely translating my Julia code directly into C++ gets to within a factor of
2 of STL's performance for the slower "pop" operation, and nearly the same as
STL for the faster "make_heap" operation. Therefore, when compiled with -O2,
my Julia code is ~10x slower than the same code translated to the C++ version.
Since this is a worse factor than for any of the tasks shown on
http://julialang.org/, I thought it would be worth looking into in more
detail.
First, the numbers:
Without optimization:
$ ./pq_test
std::make_heap took 0.00586058 seconds
std::pop_heap took 0.0380975 seconds
Extreme values: 52150 and 4.29488e+09
vector2heap took 0.00610891 seconds
pop took 0.0850708 seconds
min_val = 52150, max_val = 4.29488e+09
(From the fact that the extreme values are identical, one can infer that my
version of the algorithm, the "vector2heap" and "pop" lines, seems to be
working properly.)
With -O2:
$ ./pq_test
std::make_heap took 0.00268634 seconds
std::pop_heap took 0.0179661 seconds
Extreme values: 52150 and 4.29488e+09
vector2heap took 0.00304951 seconds
pop took 0.037797 seconds
min_val = 52150, max_val = 4.29488e+09
For reference, here are the Julia #s again (just copied from a previous
email):
julia> tic(); vector2heap!(x); toc()
elapsed time: 0.03545689582824707 seconds
0.03545689582824707
julia> tic(); for i = 1:n; heap_pop!(x); end; toc()
elapsed time: 0.37431907653808594 seconds
0.37431907653808594
Of course they fluctuate from one run to the next, but this is very
characteristic of the overall pattern, with the Julia version often taking a
bit more than 10-fold more time.
I got curious to see if I could figure out what was going on. One of the things
about heap algorithms is that they are typically heavily recursive, and so I
wondered whether Julia perhaps has a higher function-call-overhead than the C
code. So I wrote the following:
function sum_numbers{T}(x::Vector{T})
s::T = 0
for i = 1:length(x)
s += x[i]
end
return s
end
This ran approximately twice as fast as its C++ counterpart (!):
julia> const c = rand(1000000);
julia> s = 0; tic(); for j = 1:100; s += sum_numbers(c); end; toc(); s
elapsed time: 0.18297600746154785 seconds
4.997657313178386e7
C++ version:
$ ./sum_numbers
summation took 0.364052 seconds, with result 2.1476e+09
(That was even with -O2)
I then tested this function:
function recursive(n::Int)
if n > 0
recursive(n-1)
end
end
julia> tic(); for i = 1:1000; recursive(10000); end; toc()
elapsed time: 0.2863640785217285 seconds
For C++, with -O2 the recursive calls get optimized out, so I just tested the
non-optimized version:
$ ./recursive
Recursion took 0.0602431 seconds
which is about 5-fold faster. If this could be optimized, I wouldn't be
surprised if the factor rose to 10-fold.
Hence, this leads me to suspect that Julia's poorer performance on heaps is
largely due to the recursive nature of the algorithm I implemented. Perhaps
the reason could be that Julia may have a substantially-larger function-call
overhead than C++? I may try to rewrite my heap algorithm removing the
recursion and see what happens.
Hope this helps in some way.
Best,
--Tim
Compiling them can be a bit annoying depending on whether you have boost
installed in a standard location or not, if not then you'll need something
along the lines of this:
g++ -Wl,--rpath,/usr/local/lib/ -I/home/tim/src/boost_1_48_0 -L/usr/local/lib/
-lboost_chrono recursive.cpp -o recursive
Or you could change the timers over to being non-boost.
Best,
--Tim
The fib benchmark mostly covers function calls, and there we are 2x
slower than C. Yes, we have a bit of call overhead relative to C.
There are optimizations left to do there. I wouldn't go so far as to
say people should avoid recursive code in julia; for example the
quicksort benchmark is recursive and within a factor of 2 of C, which
to me is pretty good.
I removed the recursion, and this saved about 20% of the time. Nice and
worthwhile, but nothing very dramatic; it was still ~16-fold slower than using
the STL.
Then, I went ahead and wrote a separate version of each function for when the
comparison function is "<". That is to say, the former code looked like this:
function dosomething(cmp::Function,argA,...)
run some tasks
end
function dosomething(argA,...)
dosomething(<,argA,...)
end
and the new code like this:
function dosomething(cmp::Function,argA,...)
run some tasks
end
function dosomething(argA,...)
a copy of the code above with cmp(a,b) everywhere replaced with (a < b)
end
Using the version without the cmp resulted in more than a five-fold speedup.
We're now within a factor 2.5 of the STL, even when compiled with -O2. I can
live with a factor of 3; I'll confess that a factor of 16 left me with a few
doubts about whether to go to the effort to switch to Julia, or just stick to
writing MEX files all the time... And I haven't even yet studied the STL
implementation to see if I can extract some additional performance tips.
Before I commit this, let me note that the STL does not pay nearly the same
high price for a generic comparison function. Is it by any chance due to the
compiler having trouble guessing the return type of the comparison function?
Is there a way to declare that return type? Is there any equivalent of
"inline" (if this doesn't already happen automatically)?
If most/all of the above are answered with "no," then I wonder whether a
better design would be to have two implementations, one of a min-heap and
another of a max-heap, and then get rid of the idea of a generic comparison
function altogether. Presumably the generic case could be handled by
overloading the < operator using the type system anyway.
Best,
--Tim
Not by much! I declare an effective tie.
Best,
--Tim
On Thursday, March 15, 2012 05:31:02 am Jonas Rauch wrote:
> I guess in this case every
> call to "<" can be inlined, while otherwise it cannot. So what would help
> here is to have the cmp function as a template parameter rather than a
> function argument (as it is in Boost?). This way the compiler can inline
> almost all sensible comparison functions, even for composite types, for
> example stuff like cmp(x, y) = x.a < y.a.
>
> So, to the developers: How hard would it be to have a function as template
> parameter?
I've been thinking about this a bit more. I don't know much about compilers,
but I'd guess that a sufficiently-smart compiler could infer the type. The
compiler has to make a version for
heapify!(cmp::Function,Vector{double},int,int)
and in principle it seems that it should be able to ascertain that cmp is used
as cmp(double,double) and inline it. Perhaps a "hint", of the form
heapify!(cmp::Function(::T,::T)::bool,Vector{T},int,int)
would make its job easier.
As an alternative to function template paramters, it seems like it might be
better to do away with the version with a generic cmp input and simply require
that type T has a < operator. Not sure that allows inlining in general, but it
certainly seems to work for the case of Float64, etc.
I suspect that these issues could all be "temporary" consequences of Julia
being a young (but already very impressive!) project.
Best,
--Tim
On Thursday, March 15, 2012 08:29:12 am Jonas Rauch wrote:
> Now, if cmp is not known at compile time, this kind of optimization cannot
> be done.
I'm not confused about inlining; I'm confused about how smart JITs can be, and
whether Julia's multiple dispatch should be able to do this already. Obviously
it's not now (as evidenced by this performance issue), but I'm leery of
recommending that anyone add language features if instead it could be handled
within the current framework. Adding features seems like something that should
be done with considerable caution, given how beautiful Julia is currently.
So, could it be handled within the current framework? I haven't looked into
Julia's interals at all, but I'm guessing that the template mechanism is
really just syntactic sugar around multiple dispatch: for every argument, it
peeks at its type, and the combination of types determines which version of
the function it calls. The template mechanism simply provides a syntax for
automatically generating "new" versions of functions without the programmer
having to type each one out.
When you launch a function (which I presume triggers a JIT compile), you
indeed have all the information available, because you have both the function
declaration and all its arguments. So while I don't know anything about JITs,
I'm not sure there is such a thing as "not knowing cmp at compile time." You
could just peek at cmp and see what it is, and make a determination that, for
this function, it should generate a version of _heapify! with it inlined.
Alternatively/equivalently, the "::Function" type could essentially be a
template syntax of its own, and always generate separate compiled versions for
each different argument.
All this is simply me speculating wildly, without knowing a darned thing about
JITs.
Best,
--Tim
Yes, you guys have correctly figured out what is going on here. Short answer, there will be more optimizations for this kind of code. As you noticed, we haven't even really done anything with function types yet. Explicit template parameters is something we've discussed (we could make use of it in our own sort function), but you can emulate it with macros or even a code generator that keeps functions in a hash table indexed by function argument!
> -----Original Message-----
> From: juli...@googlegroups.com [mailto:juli...@googlegroups.com] On
> Behalf Of Stefan Karpinski
> Sent: Donnerstag, 15. März 2012 13:23
> To: juli...@googlegroups.com
> Subject: Re: [julia-dev] Re: Request for feedback: priority_queue
>
> A potential heuristic for specializing on a function argument would if it is called
> as a function inside of a loop. That would probably get most performance
> intensive cases, including sort and heapify. It wouldn't catch recursive cases
> though.
>
>
> On Thu, Mar 15, 2012 at 1:13 PM, Stefan Karpinski <ste...@karpinski.org>
> wrote:
>
>
> This same performance issue is why I did all sorts of macro shenanigans
> in sort.jl. I think Jeff has a couple of dispatch hoisting optimizations up his sleeve
> that might improve this situation substantially. Other things I can think of that
> would help are:
>
> 1. Allowing explicit specialization on given values (like the values
> of a function).
> 2. Make the comparison function a type parameter instead of a
> field (we currently don't support this, but if we did, it would do what we want for
> specialization).
> 3. Automatically generate specialized versions of methods that
Recursion isn't a special case; that falls out of what the compiler does already.
Best,
--Tim
On Thursday, March 15, 2012 12:13:52 pm Stefan Karpinski wrote:
> This same performance issue is why I did all sorts of macro shenanigans in
> sort.jl. I think Jeff has a couple of dispatch hoisting optimizations up
> his sleeve that might improve this situation substantially. Other things I
> can think of that would help are:
>
> 1. Allowing explicit specialization on given values (like the values of
> a function).
> 2. Make the comparison function a type parameter instead of a field (we
> currently don't support this, but if we did, it would do what we want
> for specialization).
> 3. Automatically generate specialized versions of methods that take
> >>>> http://en.wikipedia.org/wiki/**Inline_expansion<http://en.wikipedia.o
> >>>> rg/wiki/Inline_expansion> Basically what happens (and why julia is
> >>>>> heapify!(cmp::Function,Vector{**double},int,int)
> >>>>>
> >>>>> and in principle it seems that it should be able to ascertain that
> >>>>> cmp is used
> >>>>> as cmp(double,double) and inline it. Perhaps a "hint", of the form
> >>>>>
> >>>>> heapify!(cmp::Function(::T,::**T)::bool,Vector{T},int,int)
Interesting example. I guess it depends on how deeply you can look (when f is
"declared" you don't need to generate any code yet, and you can do it when
you're compiling foo), but I agree that examples like these illustrate some of
the challenge.
I'm getting beyond my expertise, I'll leave this to those who know more.
Best,
--Tim
> > 1. Allowing explicit specialization on given values (like the values
> > of a function).
> > 2. Make the comparison function a type parameter instead of a field
> > (we currently don't support this, but if we did, it would do what we
> > want for specialization).
> > 3. Automatically generate specialized versions of methods that take
> >>>>> http://en.wikipedia.org/wiki/**Inline_expansion<http://en.wikipedia.
> >>>>> org/wiki/Inline_expansion> Basically what happens (and why julia is
> >>>>>> heapify!(cmp::Function,Vector{**double},int,int)
> >>>>>>
> >>>>>> and in principle it seems that it should be able to ascertain that
> >>>>>> cmp is used
> >>>>>> as cmp(double,double) and inline it. Perhaps a "hint", of the form
> >>>>>>
> >>>>>> heapify!(cmp::Function(::T,::**T)::bool,Vector{T},int,int)
On Friday, March 16, 2012 11:56:10 am Stefan Karpinski wrote:
>How many different functions do you
> really pass to a higher-order function on anyway?
The one exception I can imagine are numeric parameters to functions; think of
a generic optimization program that gets called to optimize a single function
with different "data" many thousands (or hundreds of thousands) of times in one
program. To be explicit, in Matlab you might have a function
f(parameters_to_be_optimized, fixed_parameters)
and run
fx = @(x) f(x, parameter_set{i});
x = conjgrad(fx, x0);
for many different "i"s. This is the only case I can think of in practice where
you might have a large number of function objects.
> So that's the plan, and I think it's good. But it's non-trivial to
> implement and right now we're focusing on stabilizing things for v1.0, so
> I'm not exactly sure when it's going to get done. In the meantime, I would
> recommend implementing MinHeap and MaxHeap and hoping that it covers all
> the practical cases where someone needs something fast, and a more generic
> Heap that is like the current one, with the caveat that it's a bit slower.
> Later on, we can remove the Min/MaxHeaps and just keep Heap, unless it's
> deemed that it's worthwhile to keep the more specialized heap
> implementations.
Will do.
>
> One important note: the correct comparator to use for sorting is isless,
> not <. The reason for this is that < follows IEEE 754 semantics, as one
> would expect, meaning that NaNs are incomparable and 0.0 == -0.0. The
> incomparability of NaNs, in particular, means that comparison-based sorting
> algorithms actually won't work using < in the presence of NaNs (I have a
> blog post planned about all of this). The isless comparator, on the other
> hand, provides a total order on floats, separating -0.0 from 0.0 and
> putting all NaNs in their own equivalence class, after all other floats. It
> also, e.g., distinguishes 0 from -0.0. One reason we might want to keep
> specialized MinHeap and MaxHeap implementations even after aggressive
> higher-order method specialization is implemented is so that you can jack
> special behaviors into Min/MaxHeap{Float64}: there are tricksy ways to sort
> floats faster than using isless. See jl/sort.jl for an example (warning,
> it's really kind of confusing).
Very good, and thanks for the explanation. What's the proper implementation of
>? (You need this for a max-heap.) ~isless doesn't quite cut it, you will do
needless swaps in the case of ties.
Best,
--Tim
> Hi Stefan,
>
> On Friday, March 16, 2012 11:56:10 am Stefan Karpinski wrote:
>> How many different functions do you
>> really pass to a higher-order function on anyway?
>
> The one exception I can imagine are numeric parameters to functions; think of
> a generic optimization program that gets called to optimize a single function
> with different "data" many thousands (or hundreds of thousands) of times in one
> program. To be explicit, in Matlab you might have a function
>
> f(parameters_to_be_optimized, fixed_parameters)
>
> and run
> fx = @(x) f(x, parameter_set{i});
> x = conjgrad(fx, x0);
>
> for many different "i"s. This is the only case I can think of in practice where
> you might have a large number of function objects.
Those are actually all the same function object but with different closure bindings. That will only trigger a single specialization.
I think that isless(y,x) should do it, no?
On Friday, March 16, 2012 01:54:36 pm Stefan Karpinski wrote:
> Those are actually all the same function object but with different closure
> bindings. That will only trigger a single specialization.
Great. I don't know anything about compilers, so I thought I'd better bring it
up on the off chance it would be important.
> I think that isless(y,x) should do it, no?
Hmm. (Sticks head inside paper bag.)
--Tim
One important note: the correct comparator to use for sorting is isless, not <. The reason for this is that < follows IEEE 754 semantics, as one would expect, meaning that NaNs are incomparable and 0.0 == -0.0. The incomparability of NaNs, in particular, means that comparison-based sorting algorithms actually won't work using < in the presence of NaNs (I have a blog post planned about all of this). The isless comparator, on the other hand, provides a total order on floats, separating -0.0 from 0.0 and putting all NaNs in their own equivalence class, after all other floats. It also, e.g., distinguishes 0 from -0.0. One reason we might want to keep specialized MinHeap and MaxHeap implementations even after aggressive higher-order method specialization is implemented is so that you can jack special behaviors into Min/MaxHeap{Float64}: there are tricksy ways to sort floats faster than using isless. See jl/sort.jl for an example (warning, it's really kind of confusing).