Request for feedback: priority_queue

206 views
Skip to first unread message

Tim Holy

unread,
Mar 13, 2012, 1:41:32 PM3/13/12
to juli...@googlegroups.com
Hi,

I've written my first actual Julia code. I deliberately started on a "small"
project (and I might do a few more of those) before getting more ambitious. I
tried to pick something that doesn't yet seem to be part of the library or
example code (I could be wrong), and I decided to implement a simple priority
queue as a heap. I've attached the code; I'd be very grateful for critical
feedback, since that's the only way I'll learn.

I have several questions:
1. If this is useful, once it gets polished should it go in examples/?
2. If so, should I keep mailing patches to the list?
3. How does one handle documentation? You'll notice I put some usage examples
at the top of the file, in an attempt to be a bit helpful.

Long-term, are there plans for any kind of "help system" within Julia? Matlab
displays the initial block of comments inside the function definition when you
type "help myfunc"; it also has a pretty-printed documentation system which is
entirely divorced from those comments in the code, which can be confusing if
the two contradict each other. It might be worth keeping this issue in mind in
terms of developing policy for a future some day in which I can imagine there
might be a graphical IDE, etc.

Best,
--Tim

priority_queue.jl

Stefan Karpinski

unread,
Mar 13, 2012, 7:11:03 PM3/13/12
to juli...@googlegroups.com
This is very cool. I've only gotten a chance to take a cursory look through it, but it looks good to me. Nice use of type-parameters for the heap type and everything. I also like bundling the comparator function into the Heap object; makes total sense to me, but it hadn't occurred to me when I started glancing through.

As to contributing it, yes! Add it to examples and open a pull request and I'll add it.

Mateusz Kubica

unread,
Mar 13, 2012, 9:41:58 PM3/13/12
to juli...@googlegroups.com
Hi,
I'm just wondering if wouldn't be clearer to replace
#   isgreater(x,y) = x > y
#   h = Heap(isgreater,rand(3))
by simply
#   h = Heap((>),rand(3))
Haskell uses such a notation and I find it very compact and readable.

Mateusz


# These functions define a min-heap
# Example usage of the Heap object:
# You can create an empty min-heap this way:
#   h = Heap(Float64)
# This defaults to using isless for comparison.
# A more complex and complete example, yielding a max-heap:
#   isgreater(x,y) = x > y
#   h = Heap(isgreater,rand(3))
#   for i = 1:10
#     push!(h,rand())
#   end
#   length(h)
#   max_val = pop!(h)
#  
# Example using pure vectors (avoids making any copies):
#   z = rand(8)
#   vector2heap!(z)
#   isheap(z)
#   min_val = heap_pop!(z)
#   heap2rsorted!(z)


## Function definitions for operating on vectors ##

function _heapify!{T}(lessthan::Function,x::Vector{T},i::Int,len::Int)
    # Among node i and its two children, find the smallest
    il = 2*i;  # index of left child
    if il > len
        return
    end
    ismallest = lessthan(x[il],x[i]) ? il : i  # index of smallest child
    if il < len
        ismallest = lessthan(x[il+1],x[ismallest]) ? il+1 : ismallest
    end
    if ismallest != i
        # Put the smallest at i
        tmp = x[ismallest]
        x[ismallest] = x[i]
        x[i] = tmp
        # Recursively re-heapify
        _heapify!(lessthan,x,ismallest,len)
    end
end

function vector2heap!{T}(lessthan::Function, x::Vector{T})
    for i = convert(Int,ifloor(length(x)/2)):-1:1
        _heapify!(lessthan,x,i,length(x))
    end
end
function vector2heap!{T}(x::Vector{T})
    vector2heap!(isless,x)
end

function isheap{T}(lessthan::Function, x::Vector{T})
    for i = 1:convert(Int,ifloor(length(x)/2))
        i2 = 2*i
        if !lessthan(x[i],x[i2])
            return false
        end
        if i2 < length(x) && !lessthan(x[i],x[i2+1])
            return false
        end
    end
    return true
end
function isheap{T}(x::Vector{T})
    isheap(isless,x)
end

function heap_push!{T}(lessthan::Function, x::Vector{T},item::T)
    # append new element at the bottom
    push(x,item)
    # let the new item percolate up until it has a smaller parent
    i = length(x)
    ip = convert(Int,ifloor(i/2))  # index of parent
    while i > 1 && lessthan(x[i],x[ip])
        # swap i and its parent
        tmp = x[ip]
        x[ip] = x[i]
        x[i] = tmp
        # traverse up the tree
        i = ip
        ip = convert(Int,ifloor(i/2))  # index of parent
    end
end
function heap_push!{T}(x::Vector{T},item::T)
    heap_push!(isless,x,item)
end

function heap_pop!{T}(lessthan::Function, x::Vector{T})
    min_item = x[1]
    x[1] = x[end]
    pop(x)
    _heapify!(lessthan,x,1,length(x))
    return min_item
end
function heap_pop!{T}(x::Vector{T})
    min_item = heap_pop!(isless,x)
    return min_item
end

function heap2rsorted!{T}(lessthan::Function, x::Vector{T})
    for i = length(x):-1:2
        tmp = x[1]
        x[1] = x[i]
        x[i] = tmp
        _heapify!(lessthan,x,1,i-1)
    end
end
function heap2rsorted!{T}(x::Vector{T})
    heap2rsorted!(isless,x)
end

## Heap object ##

type Heap{T}
    lessthan::Function
    data::Array{T,1}

    function Heap(lessthan::Function,x::Array{T,1})
        data = copy(x)
        vector2heap!(lessthan,data)
        new(lessthan,data)
    end
end
Heap{T}(lessthan::Function,x::Vector{T}) = Heap{T}(lessthan,x)
Heap{T}(x::Vector{T}) = Heap{T}(isless,x)
Heap{T}(lessthan::Function,::Type{T}) = Heap{T}(lessthan,zeros(T,0))
Heap{T}(::Type{T}) = Heap{T}(isless,zeros(T,0))

function push!{T}(h::Heap{T},item::T)
    heap_push!(h.lessthan,h.data,item)
end

function pop!{T}(h::Heap{T})
    min_item = heap_pop!(h.lessthan,h.data)
    return min_item
end

function length{T}(h::Heap{T})
    return length(h.data)
end

Stefan Karpinski

unread,
Mar 13, 2012, 10:06:45 PM3/13/12
to juli...@googlegroups.com
Yeah, that works — you don't even need the parens around >, it's just a name (sort of).

Tim Holy

unread,
Mar 13, 2012, 11:44:03 PM3/13/12
to juli...@googlegroups.com
Thanks for the feedback, folks!

On Tuesday, March 13, 2012 08:41:58 pm Mateusz Kubica wrote:
> Hi,
> I'm just wondering if wouldn't be clearer to replace
> # isgreater(x,y) = x > y
> # h = Heap(isgreater,rand(3))
> by simply
> # h = Heap((>),rand(3))
> Haskell uses such a notation and I find it very compact and readable.

Changed as suggested (without the parens, as Stefan suggested). I just sent a
pull request; I'm quite new to git, so if you notice problems I'll try to fix.


I did a little timing:
julia> n = 100000
100000

julia> x = rand(n)
[0.310245, 0.0673555, 0.705961, 0.871099 ... 0.691175, 0.608054, 0.0276982,
0.563395]

julia> tic(); vector2heap!(x); toc()
elapsed time: 0.03545689582824707 seconds
0.03545689582824707

julia> tic(); for i = 1:n; heap_pop!(x); end; toc()
elapsed time: 0.37431907653808594 seconds
0.37431907653808594

There's a Matlab implementation
(http://www.mathworks.com/matlabcentral/fileexchange/34218-heap) whose
implementation details appear to be very close to what I did for the Julia
version. Timing:
>> n = 100000;
>> tic; h = Heap(rand(1,n)); toc
Elapsed time is 15.433420 seconds.
>> tic; for i = 1:n; mx = heapExtractMax(h); end; toc
Elapsed time is 262.971887 seconds.

So the Julia version absolutely smokes the Matlab version.


However, my Julia code is about 10- to 20-fold slower than a C++ STL-based
implementation, which is attached and can be compiled (on a Linux system)
with:
g++ -Wl,--rpath,/usr/local/lib/ -I/home/tim/src/boost_1_48_0 -L/usr/local/lib/
-lboost_chrono pq_test.cpp -o pq_test

(Obviously you'd have to change some of the paths here)

Here are the timing results, first without optimization:
$ ./pq_test
make_heap took 0.00302657 seconds
pop_heap took 0.0290537 seconds

And with -O2 optimization:
$ ./pq_test
make_heap took 0.00188609 seconds
pop_heap took 0.0165661 seconds

I should say that I didn't copy the C++ STL code in writing my Julia code (I
usually find the standard library hard to read and understand), and obviously
the STL is a highly-optimized library that may use a number of tricks that
could in principle be borrowed to get better performance. But, any obvious
things I should fix? I presume there is no Julia profiler yet?

Best,
--Tim

pq_test.cpp

Stefan Karpinski

unread,
Mar 14, 2012, 12:33:16 AM3/14/12
to juli...@googlegroups.com
Nice that we're beating Matlab handily. STL is still the gold standard for these kinds of things. I guess we have something to strive for ;-)

Tim Holy

unread,
Mar 14, 2012, 7:29:24 PM3/14/12
to juli...@googlegroups.com
Hi Stefan and others,

This is a long email; the gist is that Julia's performance seems to be
fantastic when you can avoid recursion/lots of function calls, and may suffer
compared to C++ when you cannot. More details below.

On Tuesday, March 13, 2012 11:33:16 pm Stefan Karpinski wrote:
> Nice that we're beating Matlab handily.

Indeed! It's huge.

> STL is still the gold standard for
> these kinds of things. I guess we have something to strive for ;-)

I did more testing, and it turns out that while STL is unquestionably good,
merely translating my Julia code directly into C++ gets to within a factor of
2 of STL's performance for the slower "pop" operation, and nearly the same as
STL for the faster "make_heap" operation. Therefore, when compiled with -O2,
my Julia code is ~10x slower than the same code translated to the C++ version.
Since this is a worse factor than for any of the tasks shown on
http://julialang.org/, I thought it would be worth looking into in more
detail.


First, the numbers:

Without optimization:
$ ./pq_test
std::make_heap took 0.00586058 seconds
std::pop_heap took 0.0380975 seconds
Extreme values: 52150 and 4.29488e+09
vector2heap took 0.00610891 seconds
pop took 0.0850708 seconds
min_val = 52150, max_val = 4.29488e+09
(From the fact that the extreme values are identical, one can infer that my
version of the algorithm, the "vector2heap" and "pop" lines, seems to be
working properly.)

With -O2:
$ ./pq_test
std::make_heap took 0.00268634 seconds
std::pop_heap took 0.0179661 seconds
Extreme values: 52150 and 4.29488e+09
vector2heap took 0.00304951 seconds
pop took 0.037797 seconds
min_val = 52150, max_val = 4.29488e+09

For reference, here are the Julia #s again (just copied from a previous
email):


julia> tic(); vector2heap!(x); toc()
elapsed time: 0.03545689582824707 seconds
0.03545689582824707

julia> tic(); for i = 1:n; heap_pop!(x); end; toc()
elapsed time: 0.37431907653808594 seconds
0.37431907653808594

Of course they fluctuate from one run to the next, but this is very
characteristic of the overall pattern, with the Julia version often taking a
bit more than 10-fold more time.


I got curious to see if I could figure out what was going on. One of the things
about heap algorithms is that they are typically heavily recursive, and so I
wondered whether Julia perhaps has a higher function-call-overhead than the C
code. So I wrote the following:

function sum_numbers{T}(x::Vector{T})
s::T = 0
for i = 1:length(x)
s += x[i]
end
return s
end

This ran approximately twice as fast as its C++ counterpart (!):
julia> const c = rand(1000000);
julia> s = 0; tic(); for j = 1:100; s += sum_numbers(c); end; toc(); s
elapsed time: 0.18297600746154785 seconds
4.997657313178386e7

C++ version:
$ ./sum_numbers
summation took 0.364052 seconds, with result 2.1476e+09
(That was even with -O2)

I then tested this function:
function recursive(n::Int)
if n > 0
recursive(n-1)
end
end

julia> tic(); for i = 1:1000; recursive(10000); end; toc()
elapsed time: 0.2863640785217285 seconds

For C++, with -O2 the recursive calls get optimized out, so I just tested the
non-optimized version:
$ ./recursive
Recursion took 0.0602431 seconds

which is about 5-fold faster. If this could be optimized, I wouldn't be
surprised if the factor rose to 10-fold.

Hence, this leads me to suspect that Julia's poorer performance on heaps is
largely due to the recursive nature of the algorithm I implemented. Perhaps
the reason could be that Julia may have a substantially-larger function-call
overhead than C++? I may try to rewrite my heap algorithm removing the
recursion and see what happens.

Hope this helps in some way.

Best,
--Tim

Tim Holy

unread,
Mar 14, 2012, 7:43:10 PM3/14/12
to juli...@googlegroups.com
Oops, meant to attach the C++ files.

Compiling them can be a bit annoying depending on whether you have boost
installed in a standard location or not, if not then you'll need something
along the lines of this:

g++ -Wl,--rpath,/usr/local/lib/ -I/home/tim/src/boost_1_48_0 -L/usr/local/lib/
-lboost_chrono recursive.cpp -o recursive

Or you could change the timers over to being non-boost.

Best,
--Tim

pq_test.cpp
priority_queue.cpp
recursive.cpp
sum_numbers.cpp

Jeff Bezanson

unread,
Mar 14, 2012, 7:43:20 PM3/14/12
to juli...@googlegroups.com
Timing stuff is always fun ;) Maybe we should add this to our standard
benchmarks, so we have something more complex in the mix, and where we
have more ground to cover relative to C++. The annoying part is
porting to all the languages, but we already have it for 2 of them.

The fib benchmark mostly covers function calls, and there we are 2x
slower than C. Yes, we have a bit of call overhead relative to C.
There are optimizations left to do there. I wouldn't go so far as to
say people should avoid recursive code in julia; for example the
quicksort benchmark is recursive and within a factor of 2 of C, which
to me is pretty good.

Tim Holy

unread,
Mar 15, 2012, 6:25:25 AM3/15/12
to juli...@googlegroups.com
Hi again,

I removed the recursion, and this saved about 20% of the time. Nice and
worthwhile, but nothing very dramatic; it was still ~16-fold slower than using
the STL.

Then, I went ahead and wrote a separate version of each function for when the
comparison function is "<". That is to say, the former code looked like this:

function dosomething(cmp::Function,argA,...)
run some tasks
end
function dosomething(argA,...)
dosomething(<,argA,...)
end

and the new code like this:
function dosomething(cmp::Function,argA,...)
run some tasks
end
function dosomething(argA,...)
a copy of the code above with cmp(a,b) everywhere replaced with (a < b)
end

Using the version without the cmp resulted in more than a five-fold speedup.
We're now within a factor 2.5 of the STL, even when compiled with -O2. I can
live with a factor of 3; I'll confess that a factor of 16 left me with a few
doubts about whether to go to the effort to switch to Julia, or just stick to
writing MEX files all the time... And I haven't even yet studied the STL
implementation to see if I can extract some additional performance tips.

Before I commit this, let me note that the STL does not pay nearly the same
high price for a generic comparison function. Is it by any chance due to the
compiler having trouble guessing the return type of the comparison function?
Is there a way to declare that return type? Is there any equivalent of
"inline" (if this doesn't already happen automatically)?

If most/all of the above are answered with "no," then I wonder whether a
better design would be to have two implementations, one of a min-heap and
another of a max-heap, and then get rid of the idea of a generic comparison
function altogether. Presumably the generic case could be handled by
overloading the < operator using the type system anyway.

Best,
--Tim

Jonas Rauch

unread,
Mar 15, 2012, 6:31:02 AM3/15/12
to juli...@googlegroups.com
I think I have found where you lose performance vs c++:
I removed cmp everywhere., just used < and did timings then. Result is about 10x faster than the original code, which probably is in a similar ballpark as C++ (did not test on my machine). I guess in this case every call to "<" can be inlined, while otherwise it cannot. So what would help here is to have the cmp function as a template parameter rather than a function argument (as it is in Boost?). This way the compiler can inline almost all sensible comparison functions, even for composite types, for example stuff like cmp(x, y) = x.a < y.a.

So, to the developers: How hard would it be to have a function as template parameter?

As a quick and dirty solution you could just write everything as a macro that depends on the comparison function. Then one would have to do something like
Heap_mycmp = @heap mycmp
or similar.

Here's my code:
https://github.com/JonasRauch/julia/tree/heap

also, I would suggest doing more iterations for tests that take less than 1 second. At least on my machine, due to CPU power management, etc., the variance in timings is extremely high otherwise.

Jonas Rauch

unread,
Mar 15, 2012, 6:35:06 AM3/15/12
to juli...@googlegroups.com
Tim, you were quicker :)
I strongly believe that C++ has the advantage because the compiler knows the comparison function (not only the return type, which should be bool anyway) at compile time, while the julia code knows nothing about cmp at compile time.

Tim Holy

unread,
Mar 15, 2012, 7:31:27 AM3/15/12
to juli...@googlegroups.com
On Thursday, March 15, 2012 05:35:06 am Jonas Rauch wrote:
> Tim, you were quicker :)

Not by much! I declare an effective tie.

Best,
--Tim

Tim Holy

unread,
Mar 15, 2012, 8:52:18 AM3/15/12
to juli...@googlegroups.com
Hi Jonas,

On Thursday, March 15, 2012 05:31:02 am Jonas Rauch wrote:
> I guess in this case every
> call to "<" can be inlined, while otherwise it cannot. So what would help
> here is to have the cmp function as a template parameter rather than a
> function argument (as it is in Boost?). This way the compiler can inline
> almost all sensible comparison functions, even for composite types, for
> example stuff like cmp(x, y) = x.a < y.a.
>
> So, to the developers: How hard would it be to have a function as template
> parameter?

I've been thinking about this a bit more. I don't know much about compilers,
but I'd guess that a sufficiently-smart compiler could infer the type. The
compiler has to make a version for
heapify!(cmp::Function,Vector{double},int,int)
and in principle it seems that it should be able to ascertain that cmp is used
as cmp(double,double) and inline it. Perhaps a "hint", of the form
heapify!(cmp::Function(::T,::T)::bool,Vector{T},int,int)
would make its job easier.

As an alternative to function template paramters, it seems like it might be
better to do away with the version with a generic cmp input and simply require
that type T has a < operator. Not sure that allows inlining in general, but it
certainly seems to work for the case of Float64, etc.

I suspect that these issues could all be "temporary" consequences of Julia
being a young (but already very impressive!) project.

Best,
--Tim

Jonas Rauch

unread,
Mar 15, 2012, 9:29:12 AM3/15/12
to juli...@googlegroups.com
Hi Tim.

The thing is, the type of "cmp" is always "Function". The return type of "cmp" is always Bool. And yes, the argument types of "cmp" are always T, and the compiler can figure this out, as everything you stick in during your algorithm has type T. But all of that is not what inlining is about:
http://en.wikipedia.org/wiki/Inline_expansion
Basically what happens (and why julia is often so fast) is that if, for example,
cmp(x::Float64, y::Float64) = <(x,y)
then if you have a function
F(...)=
   ...
   big loop
      ...
      cmp(x,y)
      ...
   end big loop
   ...
end
where x and y in the inner call to cmp are Float64, then the compiler will, during compilation, actually replace the call to cmp with the call to <, i.e.
F(...)=
   ...
   big loop
      ...
      <(x,y)
      ...
   end big loop
   ...
end
Now, since < for Floats is an elementary operation, this would actually just become one operation directly on the registers CPU.
Now, if cmp is not known at compile time, this kind of optimization cannot be done. Instead, regardless of what < actually is, a function call is made. Since a template parameter is known at compile time, it could lead to this kind of optimization.

Regards
Jonas

Patrick O'Leary

unread,
Mar 15, 2012, 9:52:46 AM3/15/12
to juli...@googlegroups.com
I wonder if the compiler could figure this out when JITting at call time if there were some way of const-marking function parameters? Of course then you have the trade off of whether it's worth it to generate specialized code which inlines particular const parameters at a given call site.

Though along those lines, what happens if you define:

dosomething(...) = dosomething(<, ...)

Does that indicate to Julia that perhaps it should compile the RHS and thus create an inlining opportunity?

Jonas Rauch

unread,
Mar 15, 2012, 10:03:38 AM3/15/12
to juli...@googlegroups.com
Apparently not. This is what the heap example looked like at first, and performance was much worse.

Tim Holy

unread,
Mar 15, 2012, 11:18:32 AM3/15/12
to juli...@googlegroups.com
Hi Jonas,

On Thursday, March 15, 2012 08:29:12 am Jonas Rauch wrote:
> Now, if cmp is not known at compile time, this kind of optimization cannot
> be done.

I'm not confused about inlining; I'm confused about how smart JITs can be, and
whether Julia's multiple dispatch should be able to do this already. Obviously
it's not now (as evidenced by this performance issue), but I'm leery of
recommending that anyone add language features if instead it could be handled
within the current framework. Adding features seems like something that should
be done with considerable caution, given how beautiful Julia is currently.

So, could it be handled within the current framework? I haven't looked into
Julia's interals at all, but I'm guessing that the template mechanism is
really just syntactic sugar around multiple dispatch: for every argument, it
peeks at its type, and the combination of types determines which version of
the function it calls. The template mechanism simply provides a syntax for
automatically generating "new" versions of functions without the programmer
having to type each one out.

When you launch a function (which I presume triggers a JIT compile), you
indeed have all the information available, because you have both the function
declaration and all its arguments. So while I don't know anything about JITs,
I'm not sure there is such a thing as "not knowing cmp at compile time." You
could just peek at cmp and see what it is, and make a determination that, for
this function, it should generate a version of _heapify! with it inlined.

Alternatively/equivalently, the "::Function" type could essentially be a
template syntax of its own, and always generate separate compiled versions for
each different argument.

All this is simply me speculating wildly, without knowing a darned thing about
JITs.

Best,
--Tim

Jeff Bezanson

unread,
Mar 15, 2012, 11:26:34 AM3/15/12
to juli...@googlegroups.com

Yes, you guys have correctly figured out what is going on here. Short answer, there will be more optimizations for this kind of code. As you noticed, we haven't even really done anything with function types yet. Explicit template parameters is something we've discussed (we could make use of it in our own sort function), but you can emulate it with macros or even a code generator that keeps functions in a hash table indexed by function argument!

Stefan Karpinski

unread,
Mar 15, 2012, 1:13:52 PM3/15/12
to juli...@googlegroups.com
This same performance issue is why I did all sorts of macro shenanigans in sort.jl. I think Jeff has a couple of dispatch hoisting optimizations up his sleeve that might improve this situation substantially. Other things I can think of that would help are:
  1. Allowing explicit specialization on given values (like the values of a function).
  2. Make the comparison function a type parameter instead of a field (we currently don't support this, but if we did, it would do what we want for specialization).
  3. Automatically generate specialized versions of methods that take function arguments for each function values called.
I'm not sure which of these is most radical. The third option would help the most cases the most transparently, but could lead to a lot of compiled code bloat. This might need a heuristic to decide when to do it or not. The second option is kind of cool since it would have an effect much like case 3 but give more explicit control over it. The first option is completely explicit and could be very handy, but might be annoying to use.

Stefan Karpinski

unread,
Mar 15, 2012, 1:23:15 PM3/15/12
to juli...@googlegroups.com
A potential heuristic for specializing on a function argument would if it is called as a function inside of a loop. That would probably get most performance intensive cases, including sort and heapify. It wouldn't catch recursive cases though.

Keno Fischer

unread,
Mar 15, 2012, 1:38:37 PM3/15/12
to juli...@googlegroups.com
I've been thinking about just those recursive cases. We could have Julia detect if a function is called recursively with the same argument types and if that's the case just skip the whole jl_apply-logic and emit a call directly to the function itself as we know that it must have been compiled (because it's calling itself).

> -----Original Message-----
> From: juli...@googlegroups.com [mailto:juli...@googlegroups.com] On
> Behalf Of Stefan Karpinski
> Sent: Donnerstag, 15. März 2012 13:23
> To: juli...@googlegroups.com
> Subject: Re: [julia-dev] Re: Request for feedback: priority_queue
>
> A potential heuristic for specializing on a function argument would if it is called
> as a function inside of a loop. That would probably get most performance
> intensive cases, including sort and heapify. It wouldn't catch recursive cases
> though.
>
>
> On Thu, Mar 15, 2012 at 1:13 PM, Stefan Karpinski <ste...@karpinski.org>
> wrote:
>
>
> This same performance issue is why I did all sorts of macro shenanigans
> in sort.jl. I think Jeff has a couple of dispatch hoisting optimizations up his sleeve
> that might improve this situation substantially. Other things I can think of that
> would help are:
>

> 1. Allowing explicit specialization on given values (like the values
> of a function).
> 2. Make the comparison function a type parameter instead of a


> field (we currently don't support this, but if we did, it would do what we want for
> specialization).

> 3. Automatically generate specialized versions of methods that

Jeff Bezanson

unread,
Mar 15, 2012, 1:48:02 PM3/15/12
to juli...@googlegroups.com

Recursion isn't a special case; that falls out of what the compiler does already.

Stefan Karpinski

unread,
Mar 15, 2012, 1:57:21 PM3/15/12
to juli...@googlegroups.com
To expand that somewhat compact answer, the compiler already skips all jl_apply logic if it can figure out exactly what the types of a functions arguments are, regardless of whether the call is recursive or not.

Jonas Rauch

unread,
Mar 15, 2012, 5:53:00 PM3/15/12
to juli...@googlegroups.com
Tim, sorry, misunderstanding on my part. In my mind, the idea of dispatch and specialization only works on types. Specialization on values never occured to me, so that's why I didn't get what you were saying. It's an interesting idea, though. I think the problem with this (Point 1 and 3 on Stefan's list) is: What do you do with something like
f(g::Function, x) = g(x)
function foo(bar)
  if bar==1
    h(x) = x
  else
    h(x) = -x
  end
  f(h, 10)
end
Now, during compile time, i.e. when you try to call foo(1), the compiler would compile everything that is needed to evaluate this expression. So when compiling f, the actual value of g is not known, it will only be constructed later. In this case it is only one of two choices, but in general h could depend arbitrarily on variables that are computed in foo, so I think there is no way of using compiler magic here. One could, of course, compile a generic version and in addition compile a specialized one, if the argument is const (or by some other logic).

To me Point 2 seems most natural. This would not only lead to better performance, but also (in this case) make clear that a heap with comparison function "<" is something different than a heap ordering differently.

Tim Holy

unread,
Mar 15, 2012, 9:11:55 PM3/15/12
to juli...@googlegroups.com
Out of curiosity, what would you recommend I do with the current heap code:
1. Leave it as is, trusting that these optimizations you mention may come soon
2. Rewrite with "macro shenanigans" (I am willing, but the cost presumably is
less readable code)
3. Implement separate minheap/maxheap functions and trust that this will cover
all practical cases

Best,
--Tim

On Thursday, March 15, 2012 12:13:52 pm Stefan Karpinski wrote:
> This same performance issue is why I did all sorts of macro shenanigans in
> sort.jl. I think Jeff has a couple of dispatch hoisting optimizations up
> his sleeve that might improve this situation substantially. Other things I
> can think of that would help are:
>

> 1. Allowing explicit specialization on given values (like the values of
> a function).
> 2. Make the comparison function a type parameter instead of a field (we


> currently don't support this, but if we did, it would do what we want
> for specialization).

> 3. Automatically generate specialized versions of methods that take

> >>>> http://en.wikipedia.org/wiki/**Inline_expansion<http://en.wikipedia.o
> >>>> rg/wiki/Inline_expansion> Basically what happens (and why julia is

> >>>>> heapify!(cmp::Function,Vector{**double},int,int)


> >>>>>
> >>>>> and in principle it seems that it should be able to ascertain that
> >>>>> cmp is used
> >>>>> as cmp(double,double) and inline it. Perhaps a "hint", of the form
> >>>>>

> >>>>> heapify!(cmp::Function(::T,::**T)::bool,Vector{T},int,int)

Tim Holy

unread,
Mar 15, 2012, 9:24:34 PM3/15/12
to juli...@googlegroups.com
Hi Jonas,

Interesting example. I guess it depends on how deeply you can look (when f is
"declared" you don't need to generate any code yet, and you can do it when
you're compiling foo), but I agree that examples like these illustrate some of
the challenge.

I'm getting beyond my expertise, I'll leave this to those who know more.

Best,
--Tim

> > 1. Allowing explicit specialization on given values (like the values
> > of a function).
> > 2. Make the comparison function a type parameter instead of a field


> > (we currently don't support this, but if we did, it would do what we
> > want for specialization).

> > 3. Automatically generate specialized versions of methods that take

> >>>>> http://en.wikipedia.org/wiki/**Inline_expansion<http://en.wikipedia.
> >>>>> org/wiki/Inline_expansion> Basically what happens (and why julia is

> >>>>>> heapify!(cmp::Function,Vector{**double},int,int)


> >>>>>>
> >>>>>> and in principle it seems that it should be able to ascertain that
> >>>>>> cmp is used
> >>>>>> as cmp(double,double) and inline it. Perhaps a "hint", of the form
> >>>>>>

> >>>>>> heapify!(cmp::Function(::T,::**T)::bool,Vector{T},int,int)

Stefan Karpinski

unread,
Mar 16, 2012, 12:56:10 PM3/16/12
to juli...@googlegroups.com
After some discussion (I've been hanging out in Boston for a few days), I convinced Jeff that the right way to solve this problem is to specialize every higher-order method call on its function-valued arguments. This means that when you call sort!(<,v) it will be just as fast as if you had written out the definition of sort! with the < comparator. This completely eliminates any need for "macro shenanigans" like the ones I did in sort.jl. There will be no penalty for using generically written higher-order functions rather than writing something more specific.

This approach fits with our current philosophy of highly aggressive specialization. Currently, we JIT specialized code for every combination of actual argument types that a function encounters (up to a point, determined by some pretty simple heuristics). All the possible combinations of types of argument values that one could end up passing to a particular function are, in principle, unbounded, but in practice, not that diverse. The heuristics serve two purposes: reducing the amount of memory overhead for compiled code (about 12%), and much more importantly, protecting us from the pathological worst case where some function gets called with an unbounded number of argument-type combinations. On the other hand, all the function arguments one could possibly pass are actually bounded by the source code (counting all instances of closures as only one), and in practice, I suspect very, very limited. How many different functions do you really pass to a higher-order function on anyway? Higher-order functions are mostly a way of composing functionality. (The only way I can think of to possibly generate an unbounded number of functions that a method might have to be specialized on would be to keep evaling new functions and passing them to some higher-order function. This behavior seems sufficiently pathological to ignore.)

So that's the plan, and I think it's good. But it's non-trivial to implement and right now we're focusing on stabilizing things for v1.0, so I'm not exactly sure when it's going to get done. In the meantime, I would recommend implementing MinHeap and MaxHeap and hoping that it covers all the practical cases where someone needs something fast, and a more generic Heap that is like the current one, with the caveat that it's a bit slower. Later on, we can remove the Min/MaxHeaps and just keep Heap, unless it's deemed that it's worthwhile to keep the more specialized heap implementations.

One important note: the correct comparator to use for sorting is isless, not <. The reason for this is that < follows IEEE 754 semantics, as one would expect, meaning that NaNs are incomparable and 0.0 == -0.0. The incomparability of NaNs, in particular, means that comparison-based sorting algorithms actually won't work using < in the presence of NaNs (I have a blog post planned about all of this). The isless comparator, on the other hand, provides a total order on floats, separating -0.0 from 0.0 and putting all NaNs in their own equivalence class, after all other floats. It also, e.g., distinguishes 0 from -0.0. One reason we might want to keep specialized MinHeap and MaxHeap implementations even after aggressive higher-order method specialization is implemented is so that you can jack special behaviors into Min/MaxHeap{Float64}: there are tricksy ways to sort floats faster than using isless. See jl/sort.jl for an example (warning, it's really kind of confusing).

Tim Holy

unread,
Mar 16, 2012, 2:18:20 PM3/16/12
to juli...@googlegroups.com
Hi Stefan,

On Friday, March 16, 2012 11:56:10 am Stefan Karpinski wrote:
>How many different functions do you
> really pass to a higher-order function on anyway?

The one exception I can imagine are numeric parameters to functions; think of
a generic optimization program that gets called to optimize a single function
with different "data" many thousands (or hundreds of thousands) of times in one
program. To be explicit, in Matlab you might have a function

f(parameters_to_be_optimized, fixed_parameters)

and run
fx = @(x) f(x, parameter_set{i});
x = conjgrad(fx, x0);

for many different "i"s. This is the only case I can think of in practice where
you might have a large number of function objects.


> So that's the plan, and I think it's good. But it's non-trivial to
> implement and right now we're focusing on stabilizing things for v1.0, so
> I'm not exactly sure when it's going to get done. In the meantime, I would
> recommend implementing MinHeap and MaxHeap and hoping that it covers all
> the practical cases where someone needs something fast, and a more generic
> Heap that is like the current one, with the caveat that it's a bit slower.
> Later on, we can remove the Min/MaxHeaps and just keep Heap, unless it's
> deemed that it's worthwhile to keep the more specialized heap
> implementations.

Will do.

>
> One important note: the correct comparator to use for sorting is isless,
> not <. The reason for this is that < follows IEEE 754 semantics, as one
> would expect, meaning that NaNs are incomparable and 0.0 == -0.0. The
> incomparability of NaNs, in particular, means that comparison-based sorting
> algorithms actually won't work using < in the presence of NaNs (I have a
> blog post planned about all of this). The isless comparator, on the other
> hand, provides a total order on floats, separating -0.0 from 0.0 and
> putting all NaNs in their own equivalence class, after all other floats. It
> also, e.g., distinguishes 0 from -0.0. One reason we might want to keep
> specialized MinHeap and MaxHeap implementations even after aggressive
> higher-order method specialization is implemented is so that you can jack
> special behaviors into Min/MaxHeap{Float64}: there are tricksy ways to sort
> floats faster than using isless. See jl/sort.jl for an example (warning,
> it's really kind of confusing).

Very good, and thanks for the explanation. What's the proper implementation of
>? (You need this for a max-heap.) ~isless doesn't quite cut it, you will do
needless swaps in the case of ties.

Best,
--Tim

Stefan Karpinski

unread,
Mar 16, 2012, 2:54:36 PM3/16/12
to juli...@googlegroups.com
On Mar 16, 2012, at 2:18 PM, Tim Holy <tim....@gmail.com> wrote:

> Hi Stefan,
>
> On Friday, March 16, 2012 11:56:10 am Stefan Karpinski wrote:
>> How many different functions do you
>> really pass to a higher-order function on anyway?
>
> The one exception I can imagine are numeric parameters to functions; think of
> a generic optimization program that gets called to optimize a single function
> with different "data" many thousands (or hundreds of thousands) of times in one
> program. To be explicit, in Matlab you might have a function
>
> f(parameters_to_be_optimized, fixed_parameters)
>
> and run
> fx = @(x) f(x, parameter_set{i});
> x = conjgrad(fx, x0);
>
> for many different "i"s. This is the only case I can think of in practice where
> you might have a large number of function objects.

Those are actually all the same function object but with different closure bindings. That will only trigger a single specialization.

I think that isless(y,x) should do it, no?

Tim Holy

unread,
Mar 16, 2012, 3:26:24 PM3/16/12
to juli...@googlegroups.com
Hi Stefan,

On Friday, March 16, 2012 01:54:36 pm Stefan Karpinski wrote:
> Those are actually all the same function object but with different closure
> bindings. That will only trigger a single specialization.

Great. I don't know anything about compilers, so I thought I'd better bring it
up on the off chance it would be important.

> I think that isless(y,x) should do it, no?

Hmm. (Sticks head inside paper bag.)

--Tim

Jonas Rauch

unread,
Mar 17, 2012, 7:48:52 PM3/17/12
to juli...@googlegroups.com
This idea of specialization sounds great. It should benefit many other applications, e.g. the ODE solvers. It also ties in well with the discussion about an optimization framework, which would heavily use callbacks. In general, this concept would encourage development in pure Julia even more, because a lot performance gains from this are lost, if function calls are done from an external library, if I am not mistaken.

Stefan Karpinski

unread,
Mar 21, 2012, 11:29:05 AM3/21/12
to juli...@googlegroups.com
On Fri, Mar 16, 2012 at 12:56 PM, Stefan Karpinski <ste...@karpinski.org> wrote:
One important note: the correct comparator to use for sorting is isless, not <. The reason for this is that < follows IEEE 754 semantics, as one would expect, meaning that NaNs are incomparable and 0.0 == -0.0. The incomparability of NaNs, in particular, means that comparison-based sorting algorithms actually won't work using < in the presence of NaNs (I have a blog post planned about all of this). The isless comparator, on the other hand, provides a total order on floats, separating -0.0 from 0.0 and putting all NaNs in their own equivalence class, after all other floats. It also, e.g., distinguishes 0 from -0.0. One reason we might want to keep specialized MinHeap and MaxHeap implementations even after aggressive higher-order method specialization is implemented is so that you can jack special behaviors into Min/MaxHeap{Float64}: there are tricksy ways to sort floats faster than using isless. See jl/sort.jl for an example (warning, it's really kind of confusing).

I just remembered another important reason for isless being different from <. In fact, it's the one that originally prompted the separation: A < B where A or B is an array creates an array of booleans. So if you try to sort an array of arrays, the sorting code blows up. On the other hand, isless, is defined to compare arrays by lexicographical ordering, so it puts all arrays in a total ordering and can be used for sorting arrays.
Reply all
Reply to author
Forward
0 new messages