# Re: [julia-dev] Performance question: Julia vs Python on a Graph Problem

824 views

### Stefan Karpinski

Aug 14, 2012, 7:12:14 PM8/14/12
I can't take a look at this just now, but I can check it out later when I have some time. There are a number of performance tips that might be of help:

Take a look at those and maybe one of them will stand out as relevant. In the meantime maybe some of the other Juliaphiles here can check out your code :-)

On Tue, Aug 14, 2012 at 7:07 PM, SirVer wrote:

Hi List,

I am an engineer/scientist and was intrigued by the microbenchmarks on Julia on the main page of the Page. I work with Python mostly these days, so I was interested how Julia would stack up against a Python implementation of mine. I took the problem of a publicly available problem set from [1] which runs for a few hours in my python implementation[2] and coded the same algorithm in Julia[3]. The problem is a finding a node in a graph that has the lowest mean path length to every other node in the graph. I reduced the problem to only handle the first 20 nodes in the graph and I find that python takes 4 seconds while Julia takes 15. I wonder what I screwed up so badly in my Julia implementation and if someone is willing to point me into the right direction. You will also need the graph file for testing:

Kind regards,
Holger H. Rapp

--

### Stefan Karpinski

Aug 14, 2012, 7:14:40 PM8/14/12
Ok, I did have a moment to skim through this and it all looks pretty reasonable. I'll take a deeper look later. Either there's some simple fix, or we have some optimizations to do.

### Stefan Karpinski

Aug 14, 2012, 7:16:16 PM8/14/12
In fact, you seem to have written some rather nice idiomatic Julia here.

### SirVer

Aug 14, 2012, 7:31:13 PM8/14/12

On Wednesday, August 15, 2012 1:16:16 AM UTC+2, Stefan Karpinski wrote:
In fact, you seem to have written some rather nice idiomatic Julia here.

That is really nice from you to say :) I was in-fact unsure how many type annotations I should give and I opted for more to make things easier on the compiler. I also ran in trouble when trying a d = {}; d[<node value>] = <float value>... This somehow didn't work. So I just typed a lot.
Another choice was if Node should be a new type or not. I opted for it.

I was a little disappointed in the limited Profiling available. I couldn't even tell *where* Julia was slower than python - something gives me the feeling it is in the Set implementation.

### Kevin Squire

Aug 14, 2012, 7:59:33 PM8/14/12
I was a little disappointed in the limited Profiling available. I couldn't even tell *where* Julia was slower than python - something gives me the feeling it is in the Set implementation.

While the profiling isn't as nice as python's, you can use "@profile begin ... end" around code that you want to profile, and then (at the julia prompt), get results with "@profile report".

Unfortunately, when I tried this on your file, I get an error:

in main: a not defined
in main at no file:74
at /home/kmsquire/tmp/graph/actor_centrality.jl:86

Not sure what the problem is, and unfortunately, I don't have time to look closer right now either.

Kevin

### Jameson Nash

Aug 14, 2012, 8:34:15 PM8/14/12
The implementation of pop is hurting performance a good bit. (although it's also a significant performance hit in python too)

replacing lines 25-26 with "for n = next" in Julia (~7.4s down to ~2.5s)
and replacing lines 15-16 with "for n in next" in python (~1s down to ~0.8s)

--

### Tim Holy

Aug 14, 2012, 10:04:18 PM8/14/12
Dear Holger,

On Tuesday, August 14, 2012 04:31:13 PM SirVer wrote:
> I was a little disappointed in the limited Profiling available.

It's definitely very limited right now. Sorry about that. To add a couple of
points to Kevin's, you have to say 'load("profile.jl")' before you can even do
any profiling at all.

Also, the profiling doesn't time certain things: in this case, that's a big
problem, because it turns out it's the computation "length(next)" in your
while loop that is the single biggest performance robber (more than twice the
time of all other lines put together). I discovered that simply by (1) knowing
that it was the line after the while statement that dominated the timing, and
(2) duplicating the length computation on the next line and profiling the .

Presumably if you can more efficiently just test whether it's empty, that will

At the moment I can't remember why I didn't put in timing of the conditions in
control/flow statements, but I have a vague memory that it's a limitation due
to elseif statements not getting their own line numbers. Perhaps if that could
be fixed, one could make further improvements.

Best,
--Tim

> I couldn't
> even tell *where* Julia was slower than python - something gives me the
> feeling it is in the Set implementation.
>
> On Tue, Aug 14, 2012 at 7:14 PM, Stefan Karpinski
> <ste...@karpinski.org<javascript:>
> > > wrote:
> >> Ok, I did have a moment to skim through this and it all looks pretty
> >> reasonable. I'll take a deeper look later. Either there's some simple
> >> fix,
> >> or we have some optimizations to do.
> >>
> >>
> >> On Tue, Aug 14, 2012 at 7:12 PM, Stefan Karpinski
> >> <ste...@karpinski.org<javascript:>>>
> >> > wrote:
> >>> I can't take a look at this just now, but I can check it out later when
> >>> I have some time. There are a number of performance tips that might be
> >>> of
> >>> help:
> >>>
> >>> http://docs.julialang.org/en/latest/manual/performance-tips/
> >>>
> >>> Take a look at those and maybe one of them will stand out as relevant.
> >>> In the meantime maybe some of the other Juliaphiles here can check out
> >>> your
> >>> code :-)
> >>>
> >>> On Tue, Aug 14, 2012 at 7:07 PM, SirVer <sir...@gmx.de
<javascript:>>wrote:
> >>>> /292002 [2] Python: https://gist.github.com/3353764

### Jameson Nash

Aug 14, 2012, 11:00:28 PM8/14/12
Updating Julia dropped my timings from ~2.5s down to ~1.4s (or maybe it was just the fact that I warmed up the JIT cache on the later trials, but I think it was the fresh pull of changes from over the weekend).

My next step was to convert (almost) all sets into lists, which brought the total timing down to about ~0.8s. Which is the same as python for 20 actors, but which I observed to be slightly faster than the python implementation for 100 actors.

Profiling revealed that this line now has the most significant impact on performance (about 50-70% of the runtime, after my optimizations):
dists[n] = cdist # 793400  1.887960   1.856224 line(30)
(I didn't use the profile pre-optimizations)

About 40% of this may be attributable to the hashing function. Adding an explicit call to hash(n) before this line neatly splits that timing roughly in those proportions (don't ask me how, since there is no caching in effect that I'm aware of).

So, it seems we are leaving quite a bit of performance on the table with the current implementation of Sets / Dicts / hashing. Anyone want to look into that?

Here's the Gist of what I changed: https://gist.github.com/3355121
There are a few more performance optimizations that I attempted also included (including replacing length with isempty), but didn't mention because I didn't see a difference.

-Jameson

(P.S. it seems there may be a bug in the profiler, since it couldn't profile the main function on my machine either due to "a is undefined")

--

### Patrick O'Leary

Aug 14, 2012, 11:44:55 PM8/14/12
On Tuesday, August 14, 2012 9:04:18 PM UTC-5, Tim wrote:
...because it turns out it's the computation "length(next)" in your
while loop that is the single biggest performance robber (more than twice the
time of all other lines put together).

This is worth expanding on a bit. Generic length() is a bit evil, actually, because for nearly anything that isn't an array this will actually iterate the entire structure just to count it, so length() is O(n). Set wraps Dict internally, which has the following length() method:

function length(t::Dict)
n = 0
for v in t.vals
n += int(!is(v,_jl_secret_table_token))
end
return n
end

When implementing the linked-list based version of LRU, I ended up tracking list length with a counter, which greatly sped up that operation (it became O(1)) at the cost of having to update the counter as well as the memory cost of the counter itself. But for large collections it seems like a reasonable trade. Is this something we'd want to consider for Dict?

### Jeff Bezanson

Aug 15, 2012, 1:38:48 AM8/15/12
Great to have more benchmarks like this where we don't do well.
Currently this can happen for programs that make good use of python's
well-honed libraries.

O(n) length() is certainly suboptimal! We probably should add a
counter to Dict. In something like IntSet the overhead of maintaining
the counter might be more significant, so I'm not sure.
> --
>
>
>

### Tim Holy

Aug 15, 2012, 5:16:57 AM8/15/12
On Tuesday, August 14, 2012 11:00:28 PM Jameson Nash wrote:
> (P.S. it seems there may be a bug in the profiler, since it couldn't profile
> the main function on my machine either due to "a is undefined")

Kevin & Jameson, you're right. Something about the profiler doesn't work with
string interpolation. I think it's because string interpolation is a macro,
and you therefore have a macro inside a macro.

Sadly, at the moment I don't have time to fix it, so I'm filing an issue
(#1159).

--Tim

### SirVer

Aug 15, 2012, 2:50:35 PM8/15/12
Wow,

thanks for all the replies! Quite impressive that people took my small example so to heart. I took many of the advices in this thread, especially iterating over all elements in the Set instead of popping from it (suggested by Jameson, 300% of the runtime of Python), using isempty instead of length which didn't change all that much - likely because the number of neighbours is always quite small. I also tried replacing Sets through lists but this impacted the performance negative in my case (Jameson had more success, but it seems like he only initializes his lists once and not each time the function is called).

I went the extra mile and implemented this example also in c++ [1]. These are the best timings I could get for 100 actors:

Julia: 11.8415s (without Graph creation and parsing)
Python: 4.61546 (without Graph creation and parsing)
C++: 2.76s (with Parsing and Graph creation)

Final gists are here

All together I am not quite happy with the performance I saw - I was unable to get close to the python performance (not speaking about c++ at all) without quite counter intuitive optimization techniques which didn't improve the code readability. I am also quite impressed by pythons performance here to be blunt. I am very impressed with the responsiveness on this mailinglist and the developers! Thanks again for your interest in this problem.

If this example is useful for the Julia developers (for example in a benchmark suite), feel free to use it. The original problem and the imdb-1.tsv file are from udacity.com [1], I believe firmly that they are okay with Julia using this example, but one should maybe ask.

Holger

### Stefan Karpinski

Aug 15, 2012, 3:14:53 PM8/15/12
I think we should absolutely include this in a next gen test suite — this kind of graph code using user-defined types is something we definitely want to have be super fast. Python's performance on this is quite impressive: it makes excellent use of it's very well-tuned built-in data structures.

Also, I would argue that going forward, we really only need to compare Julia against C/C++ and aim to be no more than 2x slower than C (and obviously, as close to 0x as we can get). The point of the original micro-benchmarks were partly to see that high-level languages are not performing well on basic constructs like iteration and recursion, and push ourselves to do better. Now that we've established the point, there's no need to belabor it. Plus writing the same code in seven languages is a massive pain.

--

### Jameson Nash

Aug 15, 2012, 3:29:16 PM8/15/12
Preallocating the arrays is helpful, but not essential.

Repeating my work based on your latest Julia code Gist (each step is adding to the previous, timings are based on 1 trial):
initially =:> elapsed time: 15.5 seconds
repeat execution (after JIT warmed up) => elapsed time: 7.4 seconds
using list instead of Set for next and nnext => elapsed time: 5.0 seconds
coping Node.n to a Vector (after initialization) => elapsed time: 3.5 seconds
reusing next and nnext allocation buckets => elapsed time: 2.5 seconds

It's quite impressive that Python has managed to be within 2x the speed of C. I believe this is due to the fact that almost all of the time in this code is spent searching the hash sets. So the fact that Python has a fast (but poor hashing algorithm), combined with the fact that all of it's set code is a well-optimized C implementation, combined with the fact it has a reference counting GC makes it very competitive.

-Jameson

--

### Stefan Karpinski

Aug 15, 2012, 3:42:52 PM8/15/12
Nice analysis. I created an issue for turning this into a benchmark in perf2.

--

### Stefan Karpinski

Aug 15, 2012, 3:44:58 PM8/15/12
On Wed, Aug 15, 2012 at 3:29 PM, Jameson Nash wrote:
using list instead of Set for next and nnext => elapsed time: 5.0 seconds

I assume you mean array here, right? It is a bizarrely insistent error that the Python community persists in calling their arrays "lists". A list is a very different data structure. Can you post your fully optimized code?

### Jameson Nash

Aug 15, 2012, 4:40:19 PM8/15/12
yes, I meant vector array (I learned Python first, so the list moniker got stuck in my head, although a linked list should have roughly the same performance here)

I updated the gist with my most recent changes: https://gist.github.com/3355121

(It's essentially the same as the update last night, but doesn't include the replacement of the comprehensions with a call to elements)

--

### Tim Holy

Aug 15, 2012, 4:48:11 PM8/15/12
On Wednesday, August 15, 2012 11:50:35 AM SirVer wrote:
> Julia: 11.8415s (without Graph creation and parsing)
> Python: 4.61546 (without Graph creation and parsing)
> C++: 2.76s (with Parsing and Graph creation)

That's...strange. On my laptop (which is nothing special) with the version I
tweaked (a bit) for performance, I get 3.54 seconds for Julia. Maybe I
disabled something in my messing around? I don't have time to check in detail,
but feel free to compare my version to yours.

Most likely, this is just noise because I broke something, but on the off
chance it's real I wanted to make sure you had it...

--Tim
actor_centrality.jl

### Jameson Nash

Aug 15, 2012, 5:36:18 PM8/15/12
Tim: If I'm not mistaken, you're using the optimized code that I posted last night. I'm not sure if that was your intention, but the timings you post would also seem to be inline with that assumption. (I'm running timings on a fast i7 laptop)

SirVer: I'm using @time main() for measuring timing, which I think does include Graph creation and parsing (and a lot of time to warm up the JIT at first), so I'm not entirely sure how these line up against the numbers you posted.
> --
>
>
>
> <actor_centrality.jl>

### Tim Holy

Aug 15, 2012, 5:39:46 PM8/15/12
On Wednesday, August 15, 2012 05:36:18 PM Jameson Nash wrote:
> Tim: If I'm not mistaken, you're using the optimized code that I posted last
> night.

> SirVer: I'm using @time main() for measuring timing, which I think does
> include Graph creation and parsing (and a lot of time to warm up the JIT at
> first)

Likewise. So either our machines are much faster, or @SirVer you're not
testing Jameson's great code!

--Tim

### Jeff Bezanson

Aug 15, 2012, 6:23:12 PM8/15/12
If I'm allowed to cheat, I notice that the only purpose of the "dists"
dictionary is to associate a number with each node. But the Node data
structure already stores per-node data, so by adding a field to Node
we can skip the dists dictionary entirely. Then the thing flies and
takes less than 2 seconds. Of course you have to reset the dist fields
between calls to centrality_mean; this is included in the timing.
> --
>
>
>

### SirVer

Aug 15, 2012, 6:32:03 PM8/15/12

> SirVer: I'm using @time main() for measuring timing, which I think does
> include Graph creation and parsing (and a lot of time to warm up the JIT at
> first)
You are totally right. So all my timings are with parsing and construction (the python code as well).

Julia: 6.5secs
Python: 4.67secs
c++: 2.82secs

Note that the C++ version is maybe not comparable: It uses std::map a lot, which is (usually) a Tree and always ordered. One could also try with std::unordered_map/set to get to a more directly comparable version, but I didn't bother so far.

Likewise. So either our machines are much faster, or @SirVer you're not
testing Jameson's great code!
My machine is a 2.26 Ghz Intel Core 2 Duo with 4 Gigs of RAM Running Mac OS. Likely your machines are faster ;)

Something different: I also tested an iterative fibonacci example. Julia beats Python here (barely: 10 mu secs vs 6.2 musecs), but gets nowhere close to the performance of c++ compiled with -O3 (much smaller than 1 musecs) or -O0 (0.4 musecs). Code for all languages can be found in this gist:
https://gist.github.com/3364304. I am a bit surprised by this. I thought julia should shine here after the JIT has warmed up - the recursive version is comparable to c++.

Holger

Cheers,
Holger

### Jameson Nash

Aug 15, 2012, 6:45:36 PM8/15/12
Ooh, I missed that one. If I'm allowed to cheat (replacing Sets with Arrays), then you should be too :) Getting rid of the Dict was the last performance step I was hoping for too.

On my machine this reduces the timing to < 1 second!
> --
>
>
>

### SirVer

Aug 15, 2012, 6:59:35 PM8/15/12

On Thursday, August 16, 2012 12:45:36 AM UTC+2, Jameson wrote:
Ooh, I missed that one. If I'm allowed to cheat (replacing Sets with Arrays), then you should be too :) Getting rid of the Dict was the last performance step I was hoping for too.

On my machine this reduces the timing to < 1 second!
Oh wow! That change brings a lot of speed. Julia outperforms my sloppy c++ implementation using this. For 1000 actors
C++: 26.95s, Julia: 19.3 secs. But obviously, the julia version is no longer really comparable from an algorithmic standpoint.

### Jameson Nash

Aug 15, 2012, 6:59:45 PM8/15/12
Timings in the microseconds are not be particularly accurate. Julia might be taking most of that time to do the initial method lookup (and other compiler and system overhead).

If I write the test so you are testing the performance of the just the Julia fib code, the function call itself is actually taking < 0.2us.

function time_fib_in_microseconds()
time_us = @elapsed begin
for i = 1:1e6 fib(uint64(92)) end
end
end
time_fib_in_microseconds()

If the C++ version is using a Tree Map (which is much better for iterating than a Hash Map), it would be unsurprising that it's so much faster -- in fact it may be doing closer to the amount of work as my optimized Julia version.

I can't explain why your computer seems to be so much slower at running my code (assuming that is the timing for at least the second call to main).
The python version takes my computer about 3 seconds. Julia takes about 4 seconds first call and 3 seconds subsequently.

--

### John Cowan

Aug 15, 2012, 7:00:00 PM8/15/12
On Wed, Aug 15, 2012 at 3:44 PM, Stefan Karpinski <ste...@karpinski.org> wrote:
> On Wed, Aug 15, 2012 at 3:29 PM, Jameson Nash <jam...@mit.edu> wrote:

> I assume you mean array here, right? It is a bizarrely insistent error that
> the Python community persists in calling their arrays "lists".

Python lists are linked lists of short blocks rather than of
individual cells. But you can insert into and delete from the middle
with O(1) amortized cost (because the block size is a constant)., and
that makes them true lists.

NumPy arrays are "real" arrays.

--
GMail doesn't have rotating .sigs, but you can see mine at
http://www.ccil.org/~cowan/signatures

### Alessandro "Jake" Andrioni

Aug 15, 2012, 7:14:00 PM8/15/12
On 15 August 2012 19:32, SirVer <sir...@gmx.de> wrote:
> Something different: I also tested an iterative fibonacci example. Julia
> beats Python here (barely: 10 mu secs vs 6.2 musecs), but gets nowhere close
Macbook Air 2011, Core i5 1.7 GHz with 4GB of RAM running Lion.
Julia: Elapsed: 2.86102294921875 musecs
Python: Elapsed: 19.00 musecs

This on julia version 0.0.0+88059312.r021c Commit 021cf71d91
(2012-06-07 05:34:18) and python 2.7.1.

This is C++ compiled with -O3 and llvm-g++ 4.2.
usuario:scratch/ \$ time ./fib
[20:10:54]
7540113804746346429
./fib 0,10s user 0,00s system 98% cpu 0,107 total

julia seems quite faster than Python, and probably a newer version
would be even more.

### Mike Nolta

Aug 15, 2012, 7:21:21 PM8/15/12
On Wed, Aug 15, 2012 at 7:00 PM, John Cowan <johnw...@gmail.com> wrote:
> On Wed, Aug 15, 2012 at 3:44 PM, Stefan Karpinski <ste...@karpinski.org> wrote:
>> On Wed, Aug 15, 2012 at 3:29 PM, Jameson Nash <jam...@mit.edu> wrote:
>
>> I assume you mean array here, right? It is a bizarrely insistent error that
>> the Python community persists in calling their arrays "lists".
>
> Python lists are linked lists of short blocks rather than of
> individual cells. But you can insert into and delete from the middle
> with O(1) amortized cost (because the block size is a constant)., and
> that makes them true lists.
>

Not according to the documentation:

http://docs.python.org/faq/design.html#how-are-lists-implemented

-Mike

> NumPy arrays are "real" arrays.
>
> --
> GMail doesn't have rotating .sigs, but you can see mine at
> http://www.ccil.org/~cowan/signatures
>
> --
>
>
>

### Viral Shah

Aug 16, 2012, 12:12:21 AM8/16/12
On Thursday, August 16, 2012 12:59:16 AM UTC+5:30, Jameson wrote:

It's quite impressive that Python has managed to be within 2x the speed of C. I believe this is due to the fact that almost all of the time in this code is spent searching the hash sets. So the fact that Python has a fast (but poor hashing algorithm), combined with the fact that all of it's set code is a well-optimized C implementation, combined with the fact it has a reference counting GC makes it very competitive.

This is a recurring theme in many codes that people have posted to the mailing list. Vectorized code today at times is slower in julia because the entire julia library is written in the language, whereas the other languages have C implementations of vectorized operations that have been tuned aggressively. Code that uses other data structures also tends to be slower, when the other languages have tuned C implementations. We haven't spent much time on optimizing GC, but there is clearly room for improvement there as well.

I guess these are some of the things that the perf 2.0 benchmarks will focus on.

-viral