O(n log n) medcouple implementation

657 views
Skip to first unread message

Jordi Gutiérrez Hermoso

unread,
Jan 8, 2015, 1:52:48 PM1/8/15
to pystatsmodels
I need to fix the current slow O(n^2) medcouple implementation in
statsmodels. The eventual goal is to have this statistic available
from matplotlib with the fast O(n log n) algorithm. This is very
tricky to do in Python, since the fast version really requires partial
sorting techniques and careful memory management that just isn't easy
to do from Python. I've tried a few implementations in Python, and I
just can't get the speed I need.

I really need C++'s std::nth_element. I don't think there's an
implemenation of this in standard C, and I only see C code in
statsmodels. Can I write C++ code for statsmodels? Would you accept
such a patch?

TIA,
- Jordi G. H.





Nathaniel Smith

unread,
Jan 8, 2015, 1:56:46 PM1/8/15
to pystatsmodels
http://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.partition.html ?

-n

--
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

josef...@gmail.com

unread,
Jan 8, 2015, 2:06:39 PM1/8/15
to pystatsmodels
Hi Jordi,

Until now we have neither C nor C++ nor Fortran code in statsmodels. All our C extensions are created using cython, which removes a large number of the error prone gory details of having to maintain c extensions. (It reduces the probability of memory leaks and crashes, and python version compatibility issues)

We would need a very strong incentive to deviate from this.

However, I think having to use cython is not really a strong limitation (for us it became even more useful because we get now also raw LAPACK in it).
A good related example might be our lowess implementation, that was converted from a python to a cython/c version and uses a a bit tricky in place array-memory manipulation.

As Nathaniel just replied. Recent versions of numpy have partial sorting which can be used if a user has a recent enough numpy installed.

Josef

Jordi Gutiérrez Hermoso

unread,
Jan 8, 2015, 2:39:49 PM1/8/15
to pystatsmodels
> Until now we have neither C nor C++ nor Fortran code in statsmodels.

Hm, I thought statsmodels/src/bspline_*.c were being used. Is this
just old cruft?

> All our C extensions are created using cython, which removes a large
> number of the error prone gory details of having to maintain c
> extensions.

Okay, I didn't notice the *.pyx files. I can probably work with Cython
instead, thanks for pointing that out. Is the C++ stdlib available
from Cython? I can't tell this at a glance.

> As Nathaniel just replied. Recent versions of numpy have partial
n> sorting which can be used if a user has a recent enough numpy
> installed.

Hm, it's annnoying when I have to depend on a recent version of numpy.
Numpy is not easy to backport, and if I can't convince my sysadmin to
install a recent enough version to fix my problem right away on our
servers, it may just be easier to use my own C++ implementation.

I'll see what I can do. I need to solve my own problem first, and
later see if I can make you accept the solution that worked for me.

It really is kind of remarkable that there is no fast free
implementation of medcouple except in an R package. They used C there
and relied on the R API instead of the C++ stdlib as I'm planning to.

- Jordi G. H.

P. S.

Google Groups is giving me trouble and is sending emails to my old
jor...@gmail.com, despite my request to send stuff to
jor...@octave.org. Once in Google, always in Google. :-(

When you reply, may I kindly request that you manually CC
jor...@octave.org? Thank you.


josef...@gmail.com

unread,
Jan 8, 2015, 3:06:04 PM1/8/15
to pystatsmodels
On Thu, Jan 8, 2015 at 2:39 PM, Jordi Gutiérrez Hermoso <jor...@octave.org> wrote:
> Until now we have neither C nor C++ nor Fortran code in statsmodels.

Hm, I thought statsmodels/src/bspline_*.c were being used. Is this
just old cruft?

leftover junk

It came from the precursor of statsmodels, it crashed when Skipper and I took over.
Since neither Skipper nor I knew how to debug segfaults in C nor had any interest in it,
these files got deactivated and retired.

The files was kept around initially, in case someone wanted bsplines in c for python.
Nobody ever did, and newer implementations that have been or are being developed use python or cython.

 

> All our C extensions are created using cython, which removes a large
> number of the error prone gory details of having to maintain c
> extensions.

Okay, I didn't notice the *.pyx files. I can probably work with Cython
instead, thanks for pointing that out. Is the C++ stdlib available
from Cython? I can't tell this at a glance.

> As Nathaniel just replied. Recent versions of numpy have partial
n> sorting which can be used if a user has a recent enough numpy
> installed.

Hm, it's annnoying when I have to depend on a recent version of numpy.
Numpy is not easy to backport, and if I can't convince my sysadmin to
install a recent enough version to fix my problem right away on our
servers, it may just be easier to use my own C++ implementation.

I'll see what I can do. I need to solve my own problem first, and
later see if I can make you accept the solution that worked for me.



Based on watching the development of scientific python in the last several years:

Unless you need any of the heavy machinery, stick with simple C and cython and
building and maintaining the functions across all different platforms and compiler versions
will essentially be without problems and extra cost.

I don't know any details, I'm just keeping roughly count about how many problems show up on various mailing lists.

 

It really is kind of remarkable that there is no fast free
implementation of medcouple except in an R package. They used C there
and relied on the R API instead of the C++ stdlib as I'm planning to.

medcouple didn't sound exciting enough to me that it would be a high priority for many python users.


Josef

Kevin Sheppard

unread,
Jan 8, 2015, 8:35:44 PM1/8/15
to pystat...@googlegroups.com
It should be easy to convert a python implementation to Cython and get acceptable performance.  
The core algorithm was implemented in Fortran for the original paper (the algorithm paper, not the , so a C-type version should be simple.

FWIW, the R implementation is GPL, so it can't be used as a basis for anything in statsmodels.

I wrote the medcouple and related functions as a "first commit" - I didn't think anyone would use it for anything really.

josef...@gmail.com

unread,
Jan 8, 2015, 9:36:09 PM1/8/15
to pystatsmodels
They are useful. We haven't done anything yet with them, but we will expand the usage of robust descriptive statistics, eventually.
I found medcouple initially because of questions about robust box and whisker plots, IIRC.

However, my impression (not based on facts) was that medcouple is numerically too expensive for what it delivers.
Maybe an optimized implementation would change that.

Josef

Jordi Gutiérrez Hermoso

unread,
Jan 9, 2015, 10:25:19 AM1/9/15
to pystatsmodels
(Please CC jor...@octave.org when you reply, since Google Groups is
broken for those of us without a Google account.)

> It should be easy to convert a python implementation to Cython and get
> acceptable performance.

Looks like it. I've managed to improve my Python implementation a
little, but I still can't match my C++ implementation. I'll try Cython
soon.

> The core algorithm was implemented in Fortran for the original paper
> (the algorithm paper, not the , so a C-type version should be
> simple.

The X+Y algorithm is actually kind of complicated, and the 1978 paper
has a very dense explanation, or perhaps I'm too stupid for it. It has
taken me several days to untangle it. Do you know where the Fortran
implementation of the X+Y algorithm can be found? I bet it doesn't
even have a free license, though, so perhaps it won't be useful for me
to base work off.

> FWIW, the R implementation is GPL, so it can't be used as a basis
> for anything in statsmodels.

Looks like the lineage of the R implementation is more complicated
than just that. It's very obviously based on the following academic
code:

http://wis.kuleuven.be/stat/robust/Programs/MEDCOUPLE/mc.zip

This code is non-free, since it forbids commercial use. The R package
removes that restriction.

The R package seems to have none of the original authors as
contributors. I'm not sure if the R developers got permission from the
original authors for a free license, but I will assume they did. I
will thus tentatively conclude that the original authors will not
pursue the R developers for applying the GPL to a derivative work for
which they do not own the copyright.

Furthermore, I don't sympathise with the current GPL phobia that is
plaguing free scientific computing. If there is good code and it's
GPLed, there is no reason to not use it. I certainly use GPLed code
commercially all the time, which increases my lack of sympathy. The
GPL is not the bugbear people make it out to be; on the contrary, it's
a very good and useful license.

Having a reference implementation to compare my reimplementation with
is very useful. There is much thought evident in that implementation,
as it handles issues related to speed, overflow, and precision. If
basing my code on the R package means that my reimplementation is
GPLed due to being derivative work, so be it. If statsmodels will keep
using inferior code due to GPL phobia, so be it.

> I wrote the medcouple and related functions as a "first commit" - I
> didn't think anyone would use it for anything really.

Well, I need it now because it turns out that adjusted boxplots work
very well for identifying outliers in my data. It's very important for
us to have very good outlier identification, so the effort of getting
a fast medcouple is worth it.


josef...@gmail.com

unread,
Jan 9, 2015, 10:45:54 AM1/9/15
to pystatsmodels, Jordi Gutiérrez Hermoso
It's not a "phobia", it's "legality".   (Hoping not to get sued is not a very robust approach.)

Since we are (re)distributing code, the GPL license terms won't let us use GPL code.
(Un)fortunately, we are stupid enough to be at the end of the license spectrum where everyone can use our code, be we cannot use theirs.

You can use GPL code commercially, you just cannot (re)distribute it under another license.

It's not a choice of using inferior code in statsmodels, the implementation of different algorithms depends on the availability of code that we are **allowed** to use and redistribute.

 

> I wrote the medcouple and related functions as a "first commit" - I
> didn't think anyone would use it for anything really.

Well, I need it now because it turns out that adjusted boxplots work
very well for identifying outliers in my data. It's very important for
us to have very good outlier identification, so the effort of getting
a fast medcouple is worth it

If you have examples and use cases where different methods for outlier detections work well, then I would be very interested in hearing more.

I was looking into outlier detection and robust estimation for the various models and cases that we have, but it dropped several times from the top of my priority lists.

 
Josef
 

Jordi Gutiérrez Hermoso

unread,
Jan 18, 2015, 9:33:42 PM1/18/15
to pystatsmodels
(Thank you for manually CC'ing me in your replies)

Hello again.

I have managed to implement the fast medcouple in Python:

http://inversethought.com/hg/medcouple/file/9a531463663e/medcouple.py

Sadly, it's still not very fast, so I then translated it as literally
as I could into standard C++:

http://inversethought.com/hg/medcouple/file/9a531463663e/jmedcouple.c%2B%2B

This is the original GPL'ed R implementation in C, here slightly
modified to run with pure standard C++ (no R API):

http://inversethought.com/hg/medcouple/file/9a531463663e/medcouple.c%2B%2B

I am happy to report that my C++ implementation seems to be as fast as
R's C implementation. I also find it a bit easier to read, and
probably within the coming days I'll be writing some expository
material on the O(n log n) algorithm, which took me quite a bit of
work to understand and many, many hours of debugging to get right.[1]

You can clone the entire repo of my work here:

hg clone http://inversethought.com/hg/medcouple

If you cannot be persuaded to use Mercurial, which some of us vastly
prefer to git, just grab a tarball:

http://inversethought.com/hg/medcouple/archive/9a531463663e.tar.gz

This includes some artificial test data, the original mex-file
implementation by the original authors of the medcouple paper
(warning: non-free, forbids commercial use), as well as a slow m-file
implementation that I used during debugging. You can run these on GNU
Octave. A skeletal Makefile is included to ease compilation of the mex
files and the C++ implementations.

Now, as to whether this code can go into statsmodels or not...

I think my code ought to be GPL'ed. I used the R code quite a bit, as
it was my reference implementation, but I also wrote most of the
Python code purely by reading the relevant papers. I did grab some
ideas from the R code for the Python implementation. I am not sure if
it would be fair to say my code is not derivative of the R
implementation if I just remove some of the ideas. Perhaps it would
be, and perhaps I could then fairly say my code is not GPL'ed. I just
don't know.

The safest thing would be to put my code under the GPL. I think there
is some misunderstanding on your part or mine as to what putting
GPL'ed code into statsmodels would entail. It would be true that it
would effectively make all of statsmodels GPL'ed, but I don't think
this would be the tragedy you seem to think it would be. After all,
good Python packages for scientific computing such as Sage are GPL'ed,
and nothing dreadful has happened to them.

At worst, if we cannot come to an agreement on the virtues of the GPL,
you could do what Scipy does with FFTW, and banish the GPL'ed code to
a "contrib" location where the end user does not obtain it unless they
want more performant code and they specifically request it.

HTH,
- Jordi G. H.

[1] The R implementation does some weird tricks in order to
irregularly do 1-based indexing in C, which made it a bit more
difficult to understand. Naturally, my Python and C++ implementations
use 0-based indexing consistently. Going back and forth between the
two kinds of indexing resulted in many difficult to track off-by-ones.





Jordi Gutiérrez Hermoso

unread,
Jan 18, 2015, 9:37:14 PM1/18/15
to josef...@gmail.com, pystatsmodels
On Fri, 2015-01-09 at 10:45 -0500, josef...@gmail.com wrote:
> If you have examples and use cases where different methods for
> outlier detections work well, then I would be very interested in
> hearing more.

I may well soon show you this data, after making sure I have
anonymised it correctly. The adjusted boxplot worked remarkably well
and gave a better idea of what "true" outliers were in most cases. I
feel like my work on the medcouple has been worth it.

- Jordi G. H.


Sturla Molden

unread,
Jan 19, 2015, 12:29:19 AM1/19/15
to pystat...@googlegroups.com
On 08/01/15 19:47, Jordi Gutiérrez Hermoso wrote:

> I really need C++'s std::nth_element.

numpy.partitition does this, albeit with a stronger algorithm than
std::nth_element (NumPy uses introselect instead of quickselect).

If you need C code, just rip introselect out of npysort.

Sturla

josef...@gmail.com

unread,
Jan 19, 2015, 1:24:20 AM1/19/15
to pystatsmodels
There is no misunderstanding  
numpy, scipy, pandas, scikit-learn, statsmodels and so on are BSD licensed and stay that way. matplotlib is Apache, IIRC.  This is one of the BSD/MIT neighborhoods of the open source community, with maybe some Apache and PSF license sprinkled in.
There is no problem copying each others code or moving it between packages nor a problem with commercial usage and licensing.
Changing the license is not up for debate. 
Sage is an island, AFAICS, while we are a medium sized country in the middle of a continent (or, more accurately, we are a bit outside the middle). 

(Usually when I'm browsing for packages, the first thing I check is the license, and if it's not BSD compatible, I stop reading unless they have some interesting documentation.)

 

At worst, if we cannot come to an agreement on the virtues of the GPL,
you could do what Scipy does with FFTW, and banish the GPL'ed code to
a "contrib" location where the end user does not obtain it unless they
want more performant code and they specifically request it.

scipy dropped all build-in support for FFTW and any other BSD incompatible packages, AFAIK.
statsmodels can optionally use CVXOPT, which is GPL, if it is installed on the users machine. That's currently the only optional non-BSD compatible dependency, AFAIK. We don't rule this case out.

Josef

Sturla Molden

unread,
Jan 19, 2015, 1:54:41 AM1/19/15
to pystat...@googlegroups.com
On 19/01/15 07:24, josef...@gmail.com wrote:

> scipy dropped all build-in support for FFTW and any other BSD
> incompatible packages, AFAIK.

MKL is commercial, so that is clearly not the case.

FFTW was dropped from NumPy becuase nobody had time to maintain bindings
for multiple FFT libraries.

Sturla




Alan G Isaac

unread,
Jan 19, 2015, 2:57:03 PM1/19/15
to pystat...@googlegroups.com
On Sun, Jan 18, 2015 at 9:33 PM, Jordi Gutiérrez Hermoso <jor...@octave.org> wrote:
> I think my code ought to be GPL'ed. I used the R code quite a bit


Don't forget that you can always ask the author of GPL'd code for
permission to release any part of it under a BSD license. So if
you would prefer BSD or MIT, at least ask about that.

fwiw,
Alan Isaac

PS This incident illustrates (once again)
how GPL'd code interferes with the kind of sharing we
would hope would dominate academic research. The GPL
only makes sense when there is realistic fear that
others would see a real advantage to producing substantial
unshared improvements to distribute as closed source
*and* would otherwise let the GPL force them to open source
their (just as great) efforts.

Jordi Gutiérrez Hermoso

unread,
Jan 19, 2015, 4:21:53 PM1/19/15
to pystatsmodels
I am very frustrated with the severe misunderstanding most of you are
exhibiting regarding the GPL and software licensing in general. It
appears to me that you are fearing that which you do not understand. I
do not think it will be fruitful to attempt to clear this
misunderstanding. I am thus sorry we could not come to an agreement,
and sorry that you are refusing to even read my code. I would like to
reiterate that you are rejecting perfectly free code due to a strong
cultural inertia of misplaced GPL fears.

Very well, my code will not be accepted into statsmodels. Your choice,
your loss.

On the bright side for you, I plan to write some expository material
over the coming week which will make it easier for anyone to implement
the O(n log n) medcouple algorithm. It was quite a bit of work to
untangle this algorithm from the various papers and the R
implementation. You may consider this the design spec of a future
clean-room reverse engineering of the algorithm. Hopefuly you can read
my description, and from that reimplement the algorithm however you
wish. If you are interested in seeing my exposition once I'm done,
please let me know, and I will notify you.

- Jordi G. H.


Matthew Brett

unread,
Jan 19, 2015, 4:31:55 PM1/19/15
to pystatsmodels
Hi,

On Mon, Jan 19, 2015 at 9:21 PM, Jordi Gutiérrez Hermoso
<jor...@octave.org> wrote:
> I am very frustrated with the severe misunderstanding most of you are
> exhibiting regarding the GPL and software licensing in general. It
> appears to me that you are fearing that which you do not understand. I
> do not think it will be fruitful to attempt to clear this
> misunderstanding.

I realize that these kinds of discussions don't often achieve
anything, but I just wanted to check that you understood that the
statsmodels team would have to change the license of the whole library
just to use your code?

Cheers,

Matthew

Sturla Molden

unread,
Jan 19, 2015, 8:11:48 PM1/19/15
to pystat...@googlegroups.com
On 19/01/15 22:21, Jordi Gutiérrez Hermoso wrote:

> I am very frustrated with the severe misunderstanding most of you are
> exhibiting regarding the GPL and software licensing in general.

I understand.

But if you want to play ball, you have to play by the rules and be a
team player. You are now in the BSD/MIT/Apache corner of the OSS community.

There is nothing that prevents you from releasing your code under the
GPL, you just cannot do it here.

:-)

Sturla

Kevin Sheppard

unread,
Jan 19, 2015, 8:16:15 PM1/19/15
to pystat...@googlegroups.com

Either a gist or a small repo would be a nice way to distribute it.  If you want a challenge, you could see if you can get a numba version which is really nice for end users since it will perform similarly to Cython without any compilation.  In fact, it is simple to copy paste numba code from a gist, unlike a Cython which requires a bit of setup to work. 

Sturla Molden

unread,
Jan 19, 2015, 8:24:10 PM1/19/15
to pystat...@googlegroups.com
On 19/01/15 07:54, Sturla Molden wrote:

> FFTW was dropped from NumPy becuase nobody had time to maintain bindings
> for multiple FFT libraries.

That is: NumPy had code that could optionally use FFTW for FFTs. But it
never had GPL'd code in the NumPy codebase. Also, NumPy was never linked
with FFTW in its official binary installers.

Linux distros, OTOH, could choose to have their packaged version of
NumPy linked with their packaged version of FFTW. It was not GPL that
killed this but lack of developer time for maintenance.

But note the difference: Optionally calling third-party GPL'd code [what
NumPy did] vs. including GPL'd code in the code-base [what was proposed
here].


Sturla


Thomas S

unread,
May 7, 2020, 10:23:26 AM5/7/20
to pystatsmodels
Amazaing! The python implementation should be the default in the statsmodel package!

Op maandag 19 januari 2015 03:33:42 UTC+1 schreef Jordi Gutiérrez Hermoso:
Reply all
Reply to author
Forward
0 new messages