As a general rule, calling Python functions for small computations/lookups within your inner loop is going to be slow, for the same reason that it is slow in Python. The only way to get good performance from Python is to perform large computations (or small computations on lots of data) in a single call that calls out to C (etc.) code.
Unfortunately, when you iterate over the query_radius data structure, you are essentially calling the equivalent of result[i][j] for every index i and j. Since result[i][j] is a Python function, you have Python calls in your inner loop (as well as memory allocation in order to allocate Julia wrappers around Python objects).
(Normally, large arrays can be passed back and forth via NumPy arrays, which have low overhead because all the Python queries are done once and then we just get a pointer to the array data. In this case, however, the query_radius function returns an array of arrays, I guess because result[i] may be of different lengths, so we necessarily have to do some Python calls for every one of the 10000 elements of the result array.)
That being said, there is something funny going on here. I wrote two different versions of your routine:
function query_radius1(balltree::PyObject, X, radius)
pyind = pycall(balltree["query_radius"],PyVector{PyObject},X,radius)
return Vector{Int}[convert(Vector{Int}, o) for o in pyind]
end
function query_radius2(balltree::PyObject, X, radius)
pyind = pycall(balltree["query_radius"],PyVector{PyObject},X,radius)
return Vector{Int}[copy(PyVector{Int}(o)) for o in pyind]
end
that should be very similar in performance, but the second version is 5x faster on my machine. I'll have to look into it.