Google Groups

First steps with Julia, performance tips


Wes McKinney May 1, 2012 12:36 PM
Posted in group: julia-dev
hey guys,

I plan to have a tinker with Julia now and then, being mainly a
Python/Cython hacker. Compiled Julia from git master today in advance
of Stefan's talk tonight.

I was curious if Julia does any optimization of array expressions, so
I set up a very simple benchmark: (Note, I am _not_ trolling even if
it seems like it and just looking to understand Julia's JIT and what's
going on)

function test1()
    n = 1000000
    a = randn(n)
    b = randn(n)
    c = randn(n)
    for i=1:500
        result = sum(a + b + c)
    end
end

@time test1()

On my machine this takes about 14-15ms per iteration. OK. Time to fire
up IPython and see how NumPy (not actually super optimized, I've had
to reimplement loads of things in NumPy by hand in Cython) does:

In [6]: timeit (a + b + c).sum()
100 loops, best of 3: 11 ms per loop

OK, not bad. Time for NumExpr:

In [13]: import numexpr as ne

In [12]: timeit ne.evaluate('sum(a + b + c)')
100 loops, best of 3: 6.9 ms per loop

So, how much performance is actually left on the table? C function:

double add_things(double *a, double *b, double *c, int n) {
  register double result = 0;
  register int i;

  for (i = 0; i < n; ++i)
  {
    result += *a++ + *b++ + *c++;
  }
  return result;
}

Wrapped in Cython and compiled:

cdef extern from "foo.h":
    double add_things(double *a, double *b, double *c, int n)


def cython_test(ndarray a, ndarray b, ndarray c):
    return add_things(<double*> a.data,
                      <double*> b.data,
                      <double*> c.data, len(a))

In [6]: timeit cython_test(a, b, c)
100 loops, best of 3: 2.26 ms per loop

Turns out doing C isn't even really necessary, straight Cython will do:

def cython_test2(ndarray[float64_t] a, ndarray[float64_t] b,
                 ndarray[float64_t] c):
    cdef:
        Py_ssize_t i, n = len(a)
        float64_t result = 0

    for i in range(n):
        result += a[i] + b[i] + c[i]

    return result

In [5]: timeit cython_test2(a, b, c)
100 loops, best of 3: 2.25 ms per loop

So here's the question: am I doing it wrong? Even in NumExpr above you
can get much better performance in array operations with no true JIT
(it has a VM that tries to eliminate temporaries). But NumExpr is
extremely limited. At minimum I was very surprised that vanilla
Python, temporaries and all, wins out over Julia in this simple
benchmark. Note that the %timeit function disables Python's GC which
may be having an effect in the timings.

cheers and looking forward to tonight's talk,

Wes