Call qsort from cython code .... with an inlined comparison function ?!

308 views
Skip to first unread message

Nathann Cohen

unread,
Nov 23, 2014, 11:33:46 AM11/23/14
to Sage devel
Hello everybody,

I wrote a bruteforce Cython code recently (#17309) which spends most of its time on calls to qsort.

This is normal, sorting is sort of the most expensive thing I do, but to call qsort you need to provide a comparison function. Now, as qsort is compiled in a library, the comparison function cannot be inlined inside of qsort and I suspect that it has a nontrivial cost (given how simple the comparison is).

Thus, if I copy/paste the original code of qsort into my Cython file the code should be faster, only that is ugly.

Sooooooo... Do you know if there ais  way to re-compile my code along with the code of qsort without having to copy/paste it ?

Thanks,

Nathann

William Stein

unread,
Nov 23, 2014, 11:44:58 AM11/23/14
to sage-devel
On Sun, Nov 23, 2014 at 8:33 AM, Nathann Cohen <nathan...@gmail.com> wrote:
> Hello everybody,
>
> I wrote a bruteforce Cython code recently (#17309) which spends most of its
> time on calls to qsort.
>
> This is normal, sorting is sort of the most expensive thing I do, but to
> call qsort you need to provide a comparison function. Now, as qsort is
> compiled in a library, the comparison function cannot be inlined inside of
> qsort and I suspect that it has a nontrivial cost (given how simple the
> comparison is).
>
> Thus, if I copy/paste the original code of qsort into my Cython file the
> code should be faster, only that is ugly.

Is it faster? You could at least do the ugly copy/paste (or #include
so it can stay C code) and find out what the speedup actually is.
Only if there is a real speedup would you then follow up on this
question.

William

>
> Sooooooo... Do you know if there ais way to re-compile my code along with
> the code of qsort without having to copy/paste it ?
>
> Thanks,
>
> Nathann
>
> --
> You received this message because you are subscribed to the Google Groups
> "sage-devel" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to sage-devel+...@googlegroups.com.
> To post to this group, send email to sage-...@googlegroups.com.
> Visit this group at http://groups.google.com/group/sage-devel.
> For more options, visit https://groups.google.com/d/optout.



--
William Stein
Professor of Mathematics
University of Washington
http://wstein.org

Volker Braun

unread,
Nov 23, 2014, 12:07:06 PM11/23/14
to sage-...@googlegroups.com
Did you profile your code on the C-level? e.g. using gprof? As a rule of thumb, guesses about where the bottleneck is are wrong :-)  Its entirely conceivable that branch prediction and speculative execution solve this already for you.

C++ std::sort will be able to inline the comparator.

Link-time optimization (e.g. gcc -flto) can in principle also inline on the level of object code, after the compilation did not inline (because of different compilation units, say).Though the libc sort is in a shared library.

Nathann Cohen

unread,
Nov 23, 2014, 12:46:04 PM11/23/14
to Sage devel
Yo !

> Did you profile your code on the C-level? e.g. using gprof? As a rule of thumb, guesses about where the bottleneck is are wrong :-) Its entirely conceivable that branch prediction and speculative execution solve this already for you.

Yeees daddy :-P

I did that like a grown man with perf record :-P

> C++ std::sort will be able to inline the comparator.

Oh cool ! Thanks I will try to call that from Cython.

> Link-time optimization (e.g. gcc -flto) can in principle also inline on the level of object code, after the compilation did not inline (because of different compilation units, say).Though the libc sort is in a shared library.

Oh... So when does that work ? I mean, if you compile everything from
source already there is nothing to optimize at link time, is there ?

Nathann

Thierry Dumont

unread,
Nov 23, 2014, 1:05:04 PM11/23/14
to sage-...@googlegroups.com
Le 23/11/2014 18:07, Volker Braun a écrit :
> Did you profile your code on the C-level? e.g. using gprof? As a rule of
> thumb, guesses about where the bottleneck is are wrong :-) Its entirely
> conceivable that branch prediction and speculative execution solve this
> already for you.
>

Is gprof enough powerful with modern architectures on such programs?
from my point of view, no.
There are non free, commercial, tools like vtune which can do fantastic
measurement job. Vtune shows, for example, that a call to std::copy is
not as fast as a for loop, which is turned by the compiler in a memcopy
(probably std::copy is not!). I do not think we can see this with gprof.
But ok, you are not supposed to buy vtune...

What about likwid https://code.google.com/p/likwid ? It is free. Did
somebody used it to measure cython code performances?

Likwid (and Vtune) have in common to use performance counters on Intel
and AMD processors (not sure for AMD with Vtune...).

What is the size of what you are sorting ? If it is small enough to fit
in the caches, and better in the L1 cache, you can possibly improve
something with your modification, but otherwise it is certainly memory
bounded and you cannot do much...
You have to measure the bandwidth of your program. Vtune does this,
possibly likwid too.

t.d.
> C++ std::sort will be able to inline the comparator.
>
> Link-time optimization (e.g. gcc -flto) can in principle also inline on
> the level of object code, after the compilation did not inline (because
> of different compilation units, say).Though the libc sort is in a shared
> library.
>
>
> On Sunday, November 23, 2014 4:33:46 PM UTC, Nathann Cohen wrote:
>
> Hello everybody,
>
> I wrote a bruteforce Cython code recently (#17309) which spends most
> of its time on calls to qsort.
>
> This is normal, sorting is sort of the most expensive thing I do,
> but to call qsort you need to provide a comparison function. Now, as
> qsort is compiled in a library, the comparison function cannot be
> inlined inside of qsort and I suspect that it has a nontrivial cost
> (given how simple the comparison is).
>
> Thus, if I copy/paste the original code of qsort into my Cython file
> the code should be faster, only that is ugly.
>
> Sooooooo... Do you know if there ais way to re-compile my code
> along with the code of qsort without having to copy/paste it ?
>
> Thanks,
>
> Nathann
>
> --
> You received this message because you are subscribed to the Google
> Groups "sage-devel" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to sage-devel+...@googlegroups.com
> <mailto:sage-devel+...@googlegroups.com>.
> To post to this group, send email to sage-...@googlegroups.com
> <mailto:sage-...@googlegroups.com>.
tdumont.vcf

Francesco Biscani

unread,
Nov 23, 2014, 1:08:43 PM11/23/14
to sage-...@googlegroups.com
On 23 November 2014 at 18:07, Volker Braun <vbrau...@gmail.com> wrote:
C++ std::sort will be able to inline the comparator.

+1

std::sort() will do exactly what you describe, only in a type-safe and compiler-checked automatic way. 

Nathann Cohen

unread,
Nov 23, 2014, 1:09:35 PM11/23/14
to Sage devel
Hello !

> What about likwid https://code.google.com/p/likwid ? It is free. Did
> somebody used it to measure cython code performances?

Never tried vtune, nor likwid.

> What is the size of what you are sorting ? If it is small enough to fit in
> the caches, and better in the L1 cache, you can possibly improve something
> with your modification, but otherwise it is certainly memory bounded and you
> cannot do much...

I sort many small arrays. Several kB, not more.

> You have to measure the bandwidth of your program. Vtune does this, possibly
> likwid too.

Oh. Do you think that it can be used from outside the program, i.e. on
a running Sage session ?

Nathann

Francesco Biscani

unread,
Nov 23, 2014, 1:11:10 PM11/23/14
to sage-...@googlegroups.com
On 23 November 2014 at 19:05, Thierry Dumont <tdu...@math.univ-lyon1.fr> wrote:
Is gprof enough powerful with modern architectures on such programs? from my point of view, no.
There are non free, commercial, tools like vtune which can do fantastic measurement job. Vtune shows, for example, that a call to std::copy is not as fast as a for loop, which is turned by the compiler in a memcopy (probably std::copy is not!). I do not think we can see this with gprof.
But ok, you are not supposed to buy vtune...

I would be surprised if any modern c++ library implementation does not have specialisations of std::copy for POD types that use memcpy() or some other trick.
 
What about likwid https://code.google.com/p/likwid ? It is free. Did somebody used it to measure cython code performances?

Likwid (and Vtune) have in common to use  performance counters on Intel and AMD processors (not sure for AMD with Vtune...).

What is the size of what you are sorting ? If it is small enough to fit in the caches, and better in the L1 cache, you can possibly improve something with your modification, but otherwise it is certainly memory bounded and you cannot do much...
You have to measure the bandwidth of your program. Vtune does this, possibly likwid too.

I used callgrind() in the past with some success... I would like to try the google cpu profiler to see how it fares, but I haven't had the chance yet.

Christopher Swenson

unread,
Nov 23, 2014, 1:29:20 PM11/23/14
to sage-...@googlegroups.com

There's a sort.h library you should be able to include that will have quick sort + many others, and will allow the compiler to properly inline functions. https://github.com/swenson/sort (disclaimer: I wrote it). I get about a 10x speed improvement over qsort just for the ability to inline functions.

I've never used it with Cython, but I don't believe there should be any problems doing so.

--
You received this message because you are subscribed to the Google Groups "sage-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sage-devel+...@googlegroups.com.
To post to this group, send email to sage-...@googlegroups.com.

Thierry Dumont

unread,
Nov 23, 2014, 2:28:04 PM11/23/14
to sage-...@googlegroups.com
Le 23/11/2014 19:07, Francesco Biscani a écrit :
> On 23 November 2014 at 18:07, Volker Braun <vbrau...@gmail.com
> <mailto:vbrau...@gmail.com>> wrote:
>
> C++ std::sort will be able to inline the comparator.
>
>
Just look at the assembly code:-)

> +1
>
> std::sort() will do exactly what you describe, only in a type-safe and
> compiler-checked automatic way.
>
> --
> You received this message because you are subscribed to the Google
> Groups "sage-devel" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to sage-devel+...@googlegroups.com
> <mailto:sage-devel+...@googlegroups.com>.
> To post to this group, send email to sage-...@googlegroups.com
> <mailto:sage-...@googlegroups.com>.
tdumont.vcf

Thierry Dumont

unread,
Nov 23, 2014, 2:41:25 PM11/23/14
to sage-...@googlegroups.com
With Vtune it is possible (you start and stop collection with a call to
the library (do not imagine I am paid by Intel :-) ,I am not.). With
likwid, I don t know.
But, ok, it seems your objects can stay in the L1 cache. In that case,
there is a possibility to improve performances by doing what you propose
to do. I had recently a problem like yours: an ode solver needs to
solve linear systems, and I got a slight improvement by replacing the
call to lapack by my own routine, probably because the lapack routine I
called (which is called at many places in the program) imposes a "far"
branch at each call... and for small systems (size 5), it is penalizing.
But what about the quick sort? is it sure that the implementation cannot
degenerate? it is well known all the efficiency can be lost if the "key"
used for partition is not chosen as it should be... What about
replacing the quick sort by an other method ? (the tree based one?).

t.d.

>
> Nathann
>

tdumont.vcf

Jeroen Demeyer

unread,
Nov 23, 2014, 2:54:04 PM11/23/14
to sage-...@googlegroups.com
On 2014-11-23 19:05, Thierry Dumont wrote:
> Vtune shows, for example, that a call to std::copy is
> not as fast as a for loop, which is turned by the compiler in a memcopy
> (probably std::copy is not!).
If that's the case, get a better C++ compiler.

Francesco Biscani

unread,
Nov 23, 2014, 3:30:13 PM11/23/14
to sage-...@googlegroups.com
On 23 November 2014 at 20:41, Thierry Dumont <tdu...@math.univ-lyon1.fr> wrote:
But what about the quick sort? is it sure that the implementation cannot degenerate? it is well known all the efficiency can be lost if the "key" used for partition is not chosen  as it should be... What about replacing the quick sort by an other method ? (the tree based one?).

qsort is not necessarily quick sort. AFAIK, the C/C++ libraries only guarantee that the sorting has logarithmic complexity in the number of comparisons.

std::sort is often implemented as a mix of introsort and other sorting algorithms, with special casing for small/big objects, small/large number of items to be sorted, etc.

  F.

Thierry Dumont

unread,
Nov 23, 2014, 3:53:50 PM11/23/14
to sage-...@googlegroups.com
heum... it is icc :-)
A c++ compiler as some unpredicable aspects...
tdumont.vcf

Francesco Biscani

unread,
Nov 23, 2014, 4:00:30 PM11/23/14
to sage-...@googlegroups.com
icc is a pretty garbage C++ compiler, unless you are doing exactly the type of linear algebra operations that they optimise to death (on Intel processors at least :) for benchmarketing.

--
You received this message because you are subscribed to the Google Groups "sage-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sage-devel+unsubscribe@googlegroups.com.
To post to this group, send email to sage-...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages