CFFI performance vs Python C extension

1,817 views
Skip to first unread message

Geert Jansen

unread,
May 11, 2013, 2:28:19 AM5/11/13
to pytho...@googlegroups.com
Hi,

I have a C function for framing JSON objects from a stream. This function is written in C for speed. Originally this function was exposed via a python extension module. I have been doing an experiment where I exposed this function via CFFI. My goal is to have one high performance function that I can share between CPython and PyPy. The results are below:

Python (2.7.3):
 - CFFI: 9.96 MiB/sec
 - extension module: 84.71 MiB/sec

For comparison I also ran the same benchmark with PyPy. To use the C module I am using cpyext. The results are:

PyPy (2.0):
 - CFFI: 17.12 MiB/sec
 - cpyext: 0.84 MiB/sec

The benchmark is here:


Any idea why on CPython, CFFI is still very much slower than a native extension module?

Regards,
Geert


Maciej Fijalkowski

unread,
May 11, 2013, 11:37:09 AM5/11/13
to pytho...@googlegroups.com
hey

Two things:

a) you can try newer pypy
b) on pypy you can try writing this function in Python, should be faster

Cheers,
fijal
> --
> -- python-cffi: To unsubscribe from this group, send email to
> python-cffi...@googlegroups.com. For more options, visit this group
> at https://groups.google.com/d/forum/python-cffi?hl=en
> ---
> You received this message because you are subscribed to the Google Groups
> "python-cffi" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to python-cffi...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

_k...@yahoo.com

unread,
May 11, 2013, 2:17:52 PM5/11/13
to pytho...@googlegroups.com


On Saturday, May 11, 2013 8:28:19 AM UTC+2, Geert Jansen wrote:
Hi,

I have a C function for framing JSON objects from a stream. This function is written in C for speed. Originally this function was exposed via a python extension module. I have been doing an experiment where I exposed this function via CFFI. My goal is to have one high performance function that I can share between CPython and PyPy. The results are below:

Just a thought:

def cffi_split(buf, offset=0, state=0):
    c_buf = ffi.new('char[]', buf)
    c_offset = ffi.new('int *', offset)
    c_state = ffi.new('int *', state)
    ret = lib.split_json(c_buf, len(buf), c_offset, c_state)
    return (ret, c_offset[0], c_state[0])

every invocation of this function allocates three buffers, while three
buffers from the previous run are freed. So you're maybe comparing apples
with pears - all this extra work isn't done elsewhere. Try allocating the
buffers once and only loop over the
lib.split_json call. If the call to
lib.split_json takes little time, the allocation/deallocation might
contribute a lot to the total processing time.

Kay

Geert Jansen

unread,
May 11, 2013, 3:49:41 PM5/11/13
to pytho...@googlegroups.com
On Sat, May 11, 2013 at 5:37 PM, Maciej Fijalkowski <fij...@gmail.com> wrote:
hey

Two things:

a) you can try newer pypy
b) on pypy you can try writing this function in Python, should be faster

This is already on PyPy 2.0. Regarding b), I am looking for a way to use an already optimized C function in an efficient way from PyPy and CPython. I hoped CFFI would offer me that. Why should re-coding it in Python help? The JIT can do a good job but why should it be faster than compiled and hand optimized C code?

Regards,
Geert

Geert Jansen

unread,
May 11, 2013, 3:52:25 PM5/11/13
to pytho...@googlegroups.com
Hi,

On Sat, May 11, 2013 at 8:17 PM, <_k...@yahoo.com> wrote:

[...]

every invocation of this function allocates three buffers, while three
buffers from the previous run are freed. So you're maybe comparing apples
with pears - all this extra work isn't done elsewhere. Try allocating the
buffers once and only loop over the
lib.split_json call. If the call to
lib.split_json takes little time, the allocation/deallocation might
contribute a lot to the total processing time.

Thanks Kay. I actually did check this by caching c_buf, and it made a bit of difference (roughly 30%). But as-is the code is 8x slower when called via CFFI. So even after this modest speedup there is an almost order-of-magnitude difference in performance.

Regards,
Geert

Armin Rigo

unread,
May 12, 2013, 2:23:33 AM5/12/13
to pytho...@googlegroups.com
Hi Geert,

I'm unsure, but I suspect that this is the problem:

> c_buf = ffi.new('char[]', buf)

It creates a char[] buffer and copies the string "buf" in it. If I
see it correctly, the CPython extension module doesn't do any copying.
I guess the benchmark takes a big string and calls split_json() a lot of times
to get various positions in this string. Then it would be a much
better idea to copy the big string only once. (This can have an
arbitrarily bad O(n^2) effect if you're going to call split_json O(n)
times on a string of length O(n).)


A bientôt,

Armin.

_k...@yahoo.com

unread,
May 13, 2013, 5:08:16 AM5/13/13
to pytho...@googlegroups.com
It's not only about c_buf. Every single call to ffi.new results in a call to malloc, even if you only allocate 5 bytes for an int, and when the object is no longer referenced, the memory is freed by a call to free. To really remove these effects, you should avoid all calls to ffi.new in your inner loop. Removing the other two calls to ffi.new, judging from what you report about having removed one of them, might therefore save you another 60%. It's quite irrelevant how many bytes you have malloc allocate or free free, the calls take about the same time no matter.

Kay

Geert Jansen

unread,
May 14, 2013, 5:05:43 AM5/14/13
to pytho...@googlegroups.com
Hi Armin,
OK, this was the problem, together with the problem that _kfj identified where I am allocating two 'int *'  cdata objects per call. Creating even a small cdata object appears expensive. I've updated the code in my code. It now pre-allocates the int arrays, and does not create the buffer but instead passes the Python string directly. The new numbers are:

$ python test_speed.py 
Speed for C extension: 95.79 MiB/sec
Speed for CFFI: 59.03 MiB/sec

So it's much closer, although there is still a small difference. (the absolute number for the C extension changed because this is a different system). For my purposes I think this is close enough.

Regards,
Geert
Reply all
Reply to author
Forward
0 new messages