Help on performance issues

464 views
Skip to first unread message

Patrick Useldinger

unread,
Oct 18, 2015, 9:49:45 PM10/18/15
to julia-users
Hi everybody,

I am currently benchmarking languages to write a kakuro solver in. I am not really familiar with Julia but ported the small algorithm I had written in Lisp and PyPy to Julia nevertheless. Unfortunately Julia's performance is really bad and I'm reaching out to the community to find out whether it's due to my coding or not.

Here's the code in Julia:

function decompose(val, num)
  function sub(num, tot, res, pos, allres)
    if val==tot && num==0
      union(allres, res)
    elseif tot<val && num>0 && !isempty(pos)
      sub(num-1, tot+pos[1], union(copy(res),pos[1]), pos[2:end], sub(num, tot, res, pos[2:end], allres))
    else
      allres
    end
  end
  sub(num, 0, Int16[], 1:(max(num, 10))-1, Array[])
end

for i in 1:45
  for j in 2:20
    res = decompose(i, j)
  end
end

It takes roughly 13.4 seconds on my computer.

The equivalent Python code

def decompose(val, num):
    def sub(num, tot, res, pos, allres):
        if tot == val and num == 0:
            return allres+res
        elif tot < val and num != 0 and pos:
            return sub(num-1, tot+pos[0], res+[pos[0]], pos[1:], sub(num, tot, res, pos[1:], allres))
        else:
            return allres
    return sub(num, 0, [], range(1, max(num, 10)), [])


if __name__ == "__main__":
    for i in range(1, 46):
        for j in range(2,21):
            res = decompose(i,j)
  
is 1.4 seconds and the Lisp

defun decompose (val num)
  (labels ((sub (num tot res pos allres)
             (cond
              ((and (= tot val) (zerop num)) (cons (reverse res) allres))
              ((and (< tot val) (not (zerop num)) (not (null pos)))
               (sub (1- num) (+ tot (car pos)) (cons (car pos) res) (cdr pos)
                    (sub num tot res (cdr pos) allres)))
              (t allres))))
    (reverse (sub num 0 nil (loop for i from 1 below (max num 10) collect i) nil))))

(time
 (loop for val from 1 below 46  do
       (loop for num from 2 below 21 do
             (let ((res (decompose val num)))))))

is 0.5 seconds.

Any hints to make the Julia code perform better?

Regards,
-Patrick

Daniel Jones

unread,
Oct 18, 2015, 10:34:54 PM10/18/15
to julia-users
Hi Patrick,

The biggest difference here is how `res` and `allres` are updated. In python version you say 'res+[pos[0]]' which creates a new list with the value appended, but in the Julia version you use `union(copy(res),pos[1])`. The union function will hash all the values in res and pos[1] to ensure there are no duplicates, so it's considerably slower that what's happening in the python or lisp version. If I change those `union` calls to `push!`, it's more closely resembles what the python version is doing, and reduces the run time to about 0.5 seconds for me.

You could also try using a linked list like the one in the DataStructures package to avoid copying `res` whenever you add a value to it, which would bring it inline with the lisp version, and could be faster.

Patrick Useldinger

unread,
Oct 19, 2015, 3:43:31 AM10/19/15
to julia-users
Hello Daniel, 
you're spot on. Using push! gets me down to 1.1 seconds.
Will look into the linked list later.
Sorry for the noise ;-)
-Patrick

Jonathan Malmaud

unread,
Oct 19, 2015, 4:13:44 AM10/19/15
to julia-users
Additionally, using closures/nested functions can often negatively impact performance due to an implementation detail of Julia that is actively being resolved.

Try moving "sub" to the top level and having it take "val" as an explicit parameter - I'd be curious how that, plus the switch to push!, ultimately impacts running time. If you could post the final optimized program for posterity's sake, that would be great! 

Jason Merrill

unread,
Oct 19, 2015, 6:51:49 AM10/19/15
to julia-users
If I'm reading correctly, it looks like decompose is intended to give all partitions of val into num summands.

Julia actually has this function as part of Base:

julia> collect(Vector{Int}, partitions(10, 3))
8-element Array{Array{Int64,1},1}:
 
[8,1,1]
 
[7,2,1]
 
[6,3,1]
 
[5,4,1]
 
[6,2,2]
 
[5,3,2]
 
[4,4,2]
 
[4,3,3]

partitions actually returns an iterator, so if you can process the partitions 1-by-1, you don't actually have to store them all at once.

You can see how it's implemented here:

https://github.com/JuliaLang/julia/blob/8396c9f5c9a3483e35f5e1f9ef63d4c6b3f45b04/base/combinatorics.jl#L307-L369

Patrick Useldinger

unread,
Oct 19, 2015, 7:35:48 AM10/19/15
to julia-users
Hi
this gets execution time down to 0.9 on my machine:

function sub(val, num, tot, res, pos, allres)
  if val==tot && num==0
    push!(allres, res)
  elseif tot<val && num>0 && !isempty(pos)
    sub(val, num-1, tot+pos[1], push!(copy(res),pos[1]), pos[2:end], sub(val, num, tot, res, pos[2:end], allres))
  else
    allres
  end
end

function decompose(val, num)
  sub(val, num, 0, Int16[], 1:(max(num, 10))-1, Array[])
end

Regards,
-Patrick

Patrick Useldinger

unread,
Oct 19, 2015, 7:39:03 AM10/19/15
to julia-users
Hello
true but no summand may appear twice, and only numbers 1 to 9 may be used. For example, (10, 3) yields

Array[Int16[2,3,5],Int16[1,4,5],Int16[1,3,6],Int16[1,2,7]]

Regards,
-Patrick

Jason Merrill

unread,
Oct 19, 2015, 9:27:05 AM10/19/15
to julia-users
On Monday, October 19, 2015 at 7:39:03 AM UTC-4, Patrick Useldinger wrote:
Hello
true but no summand may appear twice, and only numbers 1 to 9 may be used. For example, (10, 3) yields

Array[Int16[2,3,5],Int16[1,4,5],Int16[1,3,6],Int16[1,2,7]]

Regards,
-Patrick

Ah, okay, I missed those constraints. I think you could probably adapt the partitions code to generate only partitions that satisfy that constraint, but an alternative is just to filter combinations of 1:9 according to whether they have the correct sum.

julia> function decompose(val, num)
           
out = Vector{Int}[]
           
for c in combinations(1:9, num)
               
if sum(c) == val
                   push
!(out, c)
               
end
           
end
           
out
       
end

julia
> function benchmark()
         
local maxres = Int[]

         
for i in 1:45
           
for j in 2:20
             res
= decompose(i, j)

             
if length(res) > length(maxres)
               maxres
= res
             
end
           
end
         
end
         maxres
       
end

julia
> benchmark(); @time benchmark()
 
0.002602 seconds (70.52 k allocations: 5.409 MB)
12-element Array{Array{Int64,1},1}:
 
[1,2,8,9]
 
[1,3,7,9]
 
[1,4,6,9]
 
[1,4,7,8]
 
[1,5,6,8]
 
[2,3,6,9]
 
[2,3,7,8]
 
[2,4,5,9]
 
[2,4,6,8]
 
[2,5,6,7]
 
[3,4,5,8]
 
[3,4,6,7]

3ms is pretty snappy compared to the recursive solutions.

Patrick Useldinger

unread,
Oct 19, 2015, 10:00:12 AM10/19/15
to julia-users
Jason,
I cannot look into this right now but for sure the results are not correct. 
BTW the first version I posted does not give correct results either, but the last one does.
Thanks very much, I think my next step is to get more acquainted with Julia ;-)
-Patrick

Jason Merrill

unread,
Oct 19, 2015, 10:12:16 AM10/19/15
to julia-users
The combinations solution gives the same answers as the last solution you posted, in reverse order.

julia> function decompose_jm(val, num)

         
out = Vector{Int}[]
         
for c in combinations(1:9, num)
             
if sum(c) == val
                  push
!(out, c)
             
end
           
end
           
out
       
end

decompose_jm
(generic function with 1 method)

julia
> function sub(val, num, tot, res, pos, allres)
         
if val==tot && num==0
           push
!(allres, res)

         elseif tot
<val && num>0 && !isempty(pos)

           
sub(val, num-1, tot+pos[1], push!(copy(res),pos[1]), pos[2:end], sub(val, num, tot, res, pos[2:end], allres))
         
else
           allres
         
end
       
end
sub (generic function with 1 method)

julia
> function decompose_pu(val, num)
         
sub(val, num, 0, Int16[], 1:(max(num, 10))-1, Array[])
       
end
decompose_pu
(generic function with 1 method)


julia
> for i in 1:45
         
for j in 1:20
           
if decompose_jm(i, j) != reverse(decompose_pu(i, j))
             
print(i, j)
           
end
         
end
       
end; print("success")
success

Jason Merrill

unread,
Oct 21, 2015, 7:08:35 PM10/21/15
to julia-users
I got interested in trying to optimize this problem even further. Here are the results:

https://gist.github.com/jwmerrill/5b364d1887f40f889142

I was able to get the benchmark down to a few microseconds (or ~100 microseconds if you count the time to build a look up table). Either way, it's a pretty good improvement over 1+ seconds :-)

The main trick is to represent a set of digits 1-9 as a binary integer. There are only 2^9=512 such sets, so you can pack any of them into an Int16. Then you can precompute the sum of each set and store those in a look up table, so that finding the ways to decompose a given number is just a table lookup.

I think this is a pretty nice example of how Julia's dispatch system let you have complex views and operations over a very simple data structure (in this case, a single integer), with essentially 0 overhead.

Stefan Karpinski

unread,
Oct 21, 2015, 7:13:24 PM10/21/15
to Julia Users
That's a very nice implementation. Great example of how making custom types can give you a really lovely combination of usability and performance. I may use this in some talks if you don't mind!

Jason Merrill

unread,
Oct 21, 2015, 7:26:40 PM10/21/15
to Julia Users
Thanks! Definitely feel free to use this wherever you like.

Stefan Karpinski

unread,
Oct 21, 2015, 7:28:01 PM10/21/15
to Julia Users
Excellent. I will – with attribution, of course!

Jason Merrill

unread,
Nov 3, 2015, 12:43:24 AM11/3/15
to julia-users
I decided to write a blog post based on the Kakuro puzzle solver problem from this thread:

http://squishythinking.com/2015/11/02/optimizing-kakuro-in-julia/

Patrick Useldinger

unread,
Nov 3, 2015, 2:49:11 AM11/3/15
to julia-users
Jason,

very interesting reading. I'll have to go through that a few more times though ;-)

If anybody is interested, I ended up coding the whole Kakuro solver in Python 3. It will take 0.1 seconds (measured on my not-so-recent laptop) for a regular 9x9 puzzle, or 0.5 seconds for the one I included in the source code, classified "diabolical". Most of the time is actually spend in deepcopy() so you can probably get a big speedup if you create a specific deepcopy() function.

Given the fact that I need minutes to enter the puzzle in text format, this was good enough and I stopped optimisations right there. 

I ran it with pypy3 which also features a JIT compiler. The execution times were worse than with stock CPython for obvious reasons.

The code is available at https://github.com/uselpa/kakuro-solver.

Thanks to all of you for your feedback! An awesome community is what makes an excellent language a great language.

Regards,
Patrick

Kristoffer Carlsson

unread,
Nov 3, 2015, 3:17:55 AM11/3/15
to julia-users
That was a very interesting read, thanks for sharing.

Patrick Kofod Mogensen

unread,
Nov 3, 2015, 3:49:12 AM11/3/15
to julia-users
Great read.


On Tuesday, November 3, 2015 at 6:43:24 AM UTC+1, Jason Merrill wrote:
Reply all
Reply to author
Forward
0 new messages