performance of Python's `sum` vs `np.sum` vs writing a naive sum

Ecolss Logan

unread,

Apr 10, 2016, 4:04:16 AM4/10/16

to cython-users

Python has a built-in function `sum`, I just wonder what the performance will be if I use Python's `sum` in Cython, below is the code in comparison to `np.sum` and simple code-up sum.

# Python's sum

cpdef long py_sum(long[:] A):

cdef long s

s = sum(A)

return s

# simple code-up sum

cpdef long cy_sum(long[:] A):

cdef long i, n = A.shape[0], s = 0

for i in range(n):

s += A[i]

return s

Below is the comparison of 3 sums:

In [44]: %timeit py_sum(a)

100 loops, best of 3: 2.7 ms per loop

In [45]: %timeit cy_sum(a)

100000 loops, best of 3: 5.08 µs per loop

In [46]: %timeit np.sum(a)

The slowest run took 27.32 times longer than the fastest. This could mean that an intermediate result is being cached.

100000 loops, best of 3: 7 µs per loop

Apparently, `py_sum` is way too slow, `cy_sum` and `np.sum` is about the same.

I wonder how Python's `sum` is handled in Cython, and why it's so slow?

Does Cython simply call the implementation of Python's `sum`?

Thanks

Yury V. Zaytsev

unread,

Apr 10, 2016, 5:04:33 AM4/10/16

to cython-users

On Sun, 10 Apr 2016, Ecolss Logan wrote:

> I wonder how Python's `sum` is handled in Cython, and why it's so slow?
> Does Cython simply call the implementation of Python's `sum`?

Yes.

--
Sincerely yours,
Yury V. Zaytsev

Ecolss Logan

unread,

Apr 10, 2016, 7:59:33 AM4/10/16

to cython-users

在 2016年4月10日星期日 UTC+8下午5:04:33，Yury V. Zaytsev写道：

Then why it's so slow?

I thought Python's `sum` is implemented using C, the reason being slow is?

Ecolss Logan

unread,

Apr 10, 2016, 8:12:25 AM4/10/16

to cython-users

I'm very concerned about this, since if Python's built-in functions are so slow than implementing them using total Cython, then why should we use it and Why should we use Python's `list/dict/set` etc ?

That would make no sense at all.

在 2016年4月10日星期日 UTC+8下午7:59:33，Ecolss Logan写道：

Yury V. Zaytsev

unread,

Apr 10, 2016, 9:11:46 AM4/10/16

to cython-users

On Sun, 10 Apr 2016, Ecolss Logan wrote:

>
> Then why it's so slow? I thought Python's `sum` is implemented using C,
> the reason being slow is?

The reason for it being "slow" is that it is designed to operate on
iterable sequences of boxed objects (which might not necessarily represent
numbers in the first place), that is, in your case, every number from the
array is taken out, boxed into a Python object and only then processed.

This is a very inefficient approach if you *know* that you are working
with numbers, and these numbers are arranged in a contiguos memory block,
such as an array.

However, at the same time, it's a very general approach, and it will work
for any sequences of objects that implement addition, whereas a more
efficient approach of looping over an array wouldn not.

Yury V. Zaytsev

unread,

Apr 10, 2016, 9:26:24 AM4/10/16

to cython-users

On Sun, 10 Apr 2016, Ecolss Logan wrote:

> I'm very concerned about this, since if Python's built-in functions are
> so slow than implementing them using total Cython, then why should we

> use it and Why should we use Python's `list/dict/set` etc ?That would

> make no sense at all.

You need to finally realize that achieving high performance is not about
always using the most lowest level tool for the job to squeeze every last
microsecond out of the code.

A good engineer will use an *appropriate* tool for the job & stay as high
up as possible, as long as the performance doesn't matter, and focus on
dealing with real bottlenecks as identified via meaningful measurements by
implementing algorithmic improvements, and/or going down to lower levels
as required.

I hope you can see now that your question makes little sense; if you can
*prove* that it is the speed of a Python dict that limits the performance
of your application, then by all means, go ahead and roll your own.

Stefan Behnel

unread,

Apr 10, 2016, 12:41:32 PM4/10/16

to cython...@googlegroups.com

Yes. And even worse, what your py_sum() code above does is to take a
low-level array of C long values, wrap it up in a Python memory view
wrapper object, pass it into the builtin sum(), which then repeatedly asks
it for the next value through the Python iterator protocol. To comply with
that protocol, the memory view has to wrap a C long value in a Python int
object and pass it back to its caller (sum), which then happily finds out
that it's an integer object, unpacks it and adds its C value to its running
sum.

That huge protocol overhead is the reason why the difference in the numbers
above is so large. As Yury said, use the right tool for the job.

Note, BTW, that your summing code above doesn't care about overflows.
Python's sum() guards you from that, but a simple C sum does not. If the
value in "s" wraps around, you'll get incorrect results without warning.

Stefan

Ecolss Logan

unread,

Apr 11, 2016, 1:42:27 AM4/11/16

to cython-users, stef...@behnel.de

@Yury @Stefan,

Thank you both :-)

在 2016年4月11日星期一 UTC+8上午12:41:32，Stefan Behnel写道：

Chris Barker

unread,

Apr 11, 2016, 12:30:58 PM4/11/16

to cython-users

in a way this is question that has nothing to do with cython, but rather is:

why does numpy exist?

i.e. you can store arrays of numbers in a python list or tuple, and do math on it with sum() and the like, but you notice that it's a lot slower:

In [14]: arr = np.arange(10000)

In [15]: l = [float(i) for i in range(10000)]

In [17]: timeit sum(l)

10000 loops, best of 3: 57.9 µs per loop

In [18]: timeit arr.sum()

The slowest run took 11.12 times longer than the fastest. This could mean that an intermediate result is being cached

100000 loops, best of 3: 5.13 µs per loop

The answer is that numpy arrays exist because they can store homogeneous arrays of values efficiently directly in memory, and numpy functions can then manipulate these value efficiently -- for instance, a type check only needs to be done once for an entire array operation, rather than for each element.

Cython simply lets you access either regular python data structures or numpy arrays (Or other C data structures) and write C code to work with the data. but the real difference you're seeing in your examples, is the difference between raw python lists and functions and numpy.

BTW, you could code up a Cython sum() that worked with regular python lists, and I expect is would be about the same as python's built-in sum()

-CHB

--

---
You received this message because you are subscribed to the Google Groups "cython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cython-users...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA 98115   (206) 526-6317   main reception

Chris....@noaa.gov

Stefan Behnel

unread,

Apr 11, 2016, 3:51:20 PM4/11/16

to cython...@googlegroups.com

Chris Barker schrieb am 11.04.2016 um 18:30:
> BTW, you could code up a Cython sum() that worked with regular python
> lists, and I expect is would be about the same as python's built-in sum()

... except that Python's sum() is really smart and well written. It does
its computation in C space as long as it can (no pun intended) and only
gradually switches to less efficient implementations as the values grow,
with very little overhead.

https://hg.python.org/cpython/file/e3c9a47a83fb/Python/bltinmodule.c#l2179

So you'd already have to implement pretty much the same algorithm to even
come close, and then rely on Cython's fast for-loop implementation and
inlined integer unpacking to *maybe* make a tiny difference. Assuming you
can even use them, given how specialised the original implementation is.

Stefan

Chris Barker

unread,

Apr 11, 2016, 5:36:03 PM4/11/16

to cython-users

On Mon, Apr 11, 2016 at 12:51 PM, Stefan Behnel <stef...@behnel.de> wrote:

Chris Barker schrieb am 11.04.2016 um 18:30:
> BTW, you could code up a Cython sum() that worked with regular python
> lists, and I expect is would be about the same as python's built-in sum()

... except that Python's sum() is really smart and well written.

cool.

It does
its computation in C space as long as it can (no pun intended) and only
gradually switches to less efficient implementations as the values grow,
with very little overhead.

https://hg.python.org/cpython/file/e3c9a47a83fb/Python/bltinmodule.c#l2179

So you'd already have to implement pretty much the same algorithm to even
come close, and then rely on Cython's fast for-loop implementation and
inlined integer unpacking to *maybe* make a tiny difference. Assuming you
can even use them, given how specialised the original implementation is.

this would be an interesting experiment to do..

-CHB

Stefan

--

---
You received this message because you are subscribed to the Google Groups "cython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cython-users...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Denis Akhiyarov

unread,

Apr 13, 2016, 2:29:35 AM4/13/16

to cython-users, stef...@behnel.de

this looks like a perfect case for JIT!

```

In [1]: import numba as nb

                                                                                                                                                                          
In [2]: @nb.jit(nopython=True)
   ...: def sumr(i):
   ...:   s=0
   ...:   for x in range(i):
   ...:     s+=x
   ...:   return s
                                                                                                                                                                          
In [3]: r1=sum(range(int(1e6)))
                                                                                                                                                                          
In [4]: r2=sumr(int(1e6))
                                                                                                                                                                          
In [5]: import numpy as np
                                                                                                                                                                          
In [6]: r3=np.sum(np.arange(0,int(1e6)))
                                                                                                                                                                          
In [7]: r1==r2
Out[7]: True
                                                                                                                                                                          
In [8]: r3==r2
Out[8]: True
                                                                                                                                                                          
In [9]: %timeit r1=sum(range(int(1e6)))
10 loops, best of 3: 37.3 ms per loop
                                                                                                                                                                          
In [10]: %timeit r2=sumr(int(1e6))
The slowest run took 17.59 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 592 ns per loop
                                                                                                                                                                          
In [11]: %timeit np.sum(np.arange(0,int(1e6)))
The slowest run took 4.86 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 2.36 ms per loop

```

Daπid

unread,

Apr 13, 2016, 3:13:56 AM4/13/16

to cython...@googlegroups.com

On 13 April 2016 at 06:33, Denis Akhiyarov <denis.a...@gmail.com> wrote:
> In [11]: %timeit np.sum(np.arange(0,int(1e6)))
> The slowest run took 4.86 times longer than the fastest. This could mean
> that an intermediate result is being cached.
> 100 loops, best of 3: 2.36 ms per loop

Note that there you are also including the time to create the array,
that is 60%. of the total time. Only the sum would take around 1 ms.

/David.

Stefan Behnel

unread,

Apr 13, 2016, 7:59:38 AM4/13/16

to cython...@googlegroups.com

Or no time at all, given that compilers recognise loops that simply add up
a series of numbers and generate code for it that uses Gauß's formula to
compute the result in constant time. gcc certainly does. Bad benchmark.

Stefan

Oscar Benjamin

unread,

Apr 13, 2016, 9:29:05 AM4/13/16

to cython-users

Does it? I tried to test this:

$ cat gauss.c
int f(int n)
{
int total = 0;
for(int i=1; i<=n; i++)
{
total += i;
}
return total;
}

$ gcc -std=c99 -O2 -c gauss.c

$ objdump -d gauss.o

gauss.o: file format elf64-x86-64

Disassembly of section .text:

0000000000000000 <f>:
0: 31 c0 xor %eax,%eax
2: 85 ff test %edi,%edi
4: 7e 13 jle 19 <f+0x19>
6: ba 01 00 00 00 mov $0x1,%edx
b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
10: 01 d0 add %edx,%eax
12: 83 c2 01 add $0x1,%edx
15: 39 d7 cmp %edx,%edi
17: 7d f7 jge 10 <f+0x10>
19: f3 c3 repz retq

Looking at the generated assembly it seems like the same dumb loop
that I would expect (instructions 10 through 17). How do I tell gcc to
do this kind of optimisation (running on Linux with gcc 4.6.3 here)?

--
Oscar

Denis Akhiyarov

unread,

Apr 13, 2016, 2:35:41 PM4/13/16

to cython-users

I agree, this was really bad late night benchmark which was improved in SO answer:

http://stackoverflow.com/questions/36589998/why-is-numba-so-much-faster-on-this-simple-summation/36596407#36596407

Reply all

Reply to author

Forward