PyCall ... am I doing it right?

Alberto

unread,

Sep 17, 2014, 4:56:46 AM9/17/14

to julia...@googlegroups.com

Hello,

This is my first question (I started with julia only two weeks ago and I'm really happy).

The problem I'm facing now is a conversion problem when calling the python sklearn library.

This is an example of code

X = reshape(rand(20000),(10000,2))

radius = 0.05

using PyCall

@pyimport sklearn.neighbors as nb

@time balltree = nb.BallTree(X,leaf_size=30)

@time result_test = pycall(balltree["query_radius"],PyVector{Array{Int64,1}},X,radius);

@time resultJulia_test = copy(result_test);


elapsed time: 0.005120729 seconds (2204 bytes allocated)
elapsed time: 0.093411428 seconds (696 bytes allocated)
elapsed time: 8.069997536 seconds (20968304 bytes allocated, 5.34% gc time)

The conversion to julia takes 100 times compared to the actual calculation.

The next thing I tried to do is iterating the PyVector.

function iterate_Pyvector(vec)

cont = 0

for v in vec

cont += 1

end

@time iterate_Pyvector(result_test)

elapsed time: 7.865199943 seconds (20736288 bytes allocated)

The iteration of the vector 'result_test' is taking roughly the same time of converting it into a julia array.

Moreover there is also a large memory allocation going on during the iteration process, which is suggesting me that conversion is actually happening.

Is there a better way to:

A. convert the vector 'result_test'

B. iterate the vector 'result_test'

Any idea is appreciated!

Thanks

Alberto

Steven G. Johnson

unread,

Sep 17, 2014, 12:44:31 PM9/17/14

to julia...@googlegroups.com

On Wednesday, September 17, 2014 4:56:46 AM UTC-4, Alberto wrote:

@pyimport sklearn.neighbors as nb

@time balltree = nb.BallTree(X,leaf_size=30)

@time result_test = pycall(balltree["query_radius"],PyVector{Array{Int64,1}},X,radius);

@time resultJulia_test = copy(result_test);


elapsed time: 0.005120729 seconds (2204 bytes allocated)
elapsed time: 0.093411428 seconds (696 bytes allocated)
elapsed time: 8.069997536 seconds (20968304 bytes allocated, 5.34% gc time)

The conversion to julia takes 100 times compared to the actual calculation.

As a general rule, calling Python functions for small computations/lookups within your inner loop is going to be slow, for the same reason that it is slow in Python. The only way to get good performance from Python is to perform large computations (or small computations on lots of data) in a single call that calls out to C (etc.) code.

Unfortunately, when you iterate over the query_radius data structure, you are essentially calling the equivalent of result[i][j] for every index i and j. Since result[i][j] is a Python function, you have Python calls in your inner loop (as well as memory allocation in order to allocate Julia wrappers around Python objects).

(Normally, large arrays can be passed back and forth via NumPy arrays, which have low overhead because all the Python queries are done once and then we just get a pointer to the array data. In this case, however, the query_radius function returns an array of arrays, I guess because result[i] may be of different lengths, so we necessarily have to do some Python calls for every one of the 10000 elements of the result array.)

That being said, there is something funny going on here. I wrote two different versions of your routine:

function query_radius1(balltree::PyObject, X, radius)

pyind = pycall(balltree["query_radius"],PyVector{PyObject},X,radius)

return Vector{Int}[convert(Vector{Int}, o) for o in pyind]

end

function query_radius2(balltree::PyObject, X, radius)

pyind = pycall(balltree["query_radius"],PyVector{PyObject},X,radius)

return Vector{Int}[copy(PyVector{Int}(o)) for o in pyind]

end

that should be very similar in performance, but the second version is 5x faster on my machine. I'll have to look into it.

Alberto

unread,

Sep 17, 2014, 3:30:08 PM9/17/14

to julia...@googlegroups.com

Hello Steven!

Thank you for the very accurate explanation and for the magic at the end!

I compared the two functions as well and obtained results similar to yours on my machine.

Alberto

Steven G. Johnson

unread,

Sep 17, 2014, 3:45:07 PM9/17/14

to julia...@googlegroups.com

https://github.com/stevengj/PyCall.jl/issues/90

Reply all

Reply to author

Forward