| O(n log n) medcouple implementation | Jordi Gutiérrez Hermoso | 08/01/15 10:52 | I need to fix the current slow O(n^2) medcouple implementation in
statsmodels. The eventual goal is to have this statistic available from matplotlib with the fast O(n log n) algorithm. This is very tricky to do in Python, since the fast version really requires partial sorting techniques and careful memory management that just isn't easy to do from Python. I've tried a few implementations in Python, and I just can't get the speed I need. I really need C++'s std::nth_element. I don't think there's an implemenation of this in standard C, and I only see C code in statsmodels. Can I write C++ code for statsmodels? Would you accept such a patch? TIA, - Jordi G. H. |
| Re: [pystatsmodels] O(n log n) medcouple implementation | Nathaniel Smith | 08/01/15 10:56 | http://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.partition.html ?
-n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org |
| Re: [pystatsmodels] O(n log n) medcouple implementation | josefpktd | 08/01/15 11:06 | Hi Jordi, Until now we have neither C nor C++ nor Fortran code in statsmodels. All our C extensions are created using cython, which removes a large number of the error prone gory details of having to maintain c extensions. (It reduces the probability of memory leaks and crashes, and python version compatibility issues) We would need a very strong incentive to deviate from this. However, I think having to use cython is not really a strong limitation (for us it became even more useful because we get now also raw LAPACK in it). A good related example might be our lowess implementation, that was converted from a python to a cython/c version and uses a a bit tricky in place array-memory manipulation. As Nathaniel just replied. Recent versions of numpy have partial sorting which can be used if a user has a recent enough numpy installed. Josef
|
| Re: O(n log n) medcouple implementation | Jordi Gutiérrez Hermoso | 08/01/15 11:39 | > Until now we have neither C nor C++ nor Fortran code in statsmodels.Hm, I thought statsmodels/src/bspline_*.c were being used. Is this just old cruft? Okay, I didn't notice the *.pyx files. I can probably work with Cython instead, thanks for pointing that out. Is the C++ stdlib available from Cython? I can't tell this at a glance. n> sorting which can be used if a user has a recent enough numpy > installed. Hm, it's annnoying when I have to depend on a recent version of numpy. Numpy is not easy to backport, and if I can't convince my sysadmin to install a recent enough version to fix my problem right away on our servers, it may just be easier to use my own C++ implementation. I'll see what I can do. I need to solve my own problem first, and later see if I can make you accept the solution that worked for me. It really is kind of remarkable that there is no fast free implementation of medcouple except in an R package. They used C there and relied on the R API instead of the C++ stdlib as I'm planning to. - Jordi G. H. P. S. Google Groups is giving me trouble and is sending emails to my old jor...@gmail.com, despite my request to send stuff to jor...@octave.org. Once in Google, always in Google. :-( When you reply, may I kindly request that you manually CC jor...@octave.org? Thank you. |
| Re: [pystatsmodels] Re: O(n log n) medcouple implementation | josefpktd | 08/01/15 12:06 | On Thu, Jan 8, 2015 at 2:39 PM, Jordi Gutiérrez Hermoso <jor...@octave.org> wrote:> Until now we have neither C nor C++ nor Fortran code in statsmodels. leftover junk It came from the precursor of statsmodels, it crashed when Skipper and I took over. Since neither Skipper nor I knew how to debug segfaults in C nor had any interest in it, these files got deactivated and retired. The files was kept around initially, in case someone wanted bsplines in c for python. Nobody ever did, and newer implementations that have been or are being developed use python or cython.
Based on watching the development of scientific python in the last several years: Unless you need any of the heavy machinery, stick with simple C and cython and building and maintaining the functions across all different platforms and compiler versions will essentially be without problems and extra cost. I don't know any details, I'm just keeping roughly count about how many problems show up on various mailing lists.
medcouple didn't sound exciting enough to me that it would be a high priority for many python users. Josef
|
| Re: O(n log n) medcouple implementation | Kevin Sheppard | 08/01/15 17:35 | It should be easy to convert a python implementation to Cython and get acceptable performance. The core algorithm was implemented in Fortran for the original paper (the algorithm paper, not the , so a C-type version should be simple. FWIW, the R implementation is GPL, so it can't be used as a basis for anything in statsmodels. I wrote the medcouple and related functions as a "first commit" - I didn't think anyone would use it for anything really. |
| Re: [pystatsmodels] Re: O(n log n) medcouple implementation | josefpktd | 08/01/15 18:36 | They are useful. We haven't done anything yet with them, but we will expand the usage of robust descriptive statistics, eventually. I found medcouple initially because of questions about robust box and whisker plots, IIRC. However, my impression (not based on facts) was that medcouple is numerically too expensive for what it delivers. Maybe an optimized implementation would change that. Josef |
| Re: O(n log n) medcouple implementation | Jordi Gutiérrez Hermoso | 09/01/15 07:25 | (Please CC jor...@octave.org when you reply, since Google Groups is
broken for those of us without a Google account.) Looks like it. I've managed to improve my Python implementation a little, but I still can't match my C++ implementation. I'll try Cython soon. The X+Y algorithm is actually kind of complicated, and the 1978 paper has a very dense explanation, or perhaps I'm too stupid for it. It has taken me several days to untangle it. Do you know where the Fortran implementation of the X+Y algorithm can be found? I bet it doesn't even have a free license, though, so perhaps it won't be useful for me to base work off. Looks like the lineage of the R implementation is more complicated than just that. It's very obviously based on the following academic code: http://wis.kuleuven.be/stat/robust/Programs/MEDCOUPLE/mc.zip This code is non-free, since it forbids commercial use. The R package removes that restriction. The R package seems to have none of the original authors as contributors. I'm not sure if the R developers got permission from the original authors for a free license, but I will assume they did. I will thus tentatively conclude that the original authors will not pursue the R developers for applying the GPL to a derivative work for which they do not own the copyright. Furthermore, I don't sympathise with the current GPL phobia that is plaguing free scientific computing. If there is good code and it's GPLed, there is no reason to not use it. I certainly use GPLed code commercially all the time, which increases my lack of sympathy. The GPL is not the bugbear people make it out to be; on the contrary, it's a very good and useful license. Having a reference implementation to compare my reimplementation with is very useful. There is much thought evident in that implementation, as it handles issues related to speed, overflow, and precision. If basing my code on the R package means that my reimplementation is GPLed due to being derivative work, so be it. If statsmodels will keep using inferior code due to GPL phobia, so be it. Well, I need it now because it turns out that adjusted boxplots work very well for identifying outliers in my data. It's very important for us to have very good outlier identification, so the effort of getting a fast medcouple is worth it. |
| Re: [pystatsmodels] Re: O(n log n) medcouple implementation | josefpktd | 09/01/15 07:45 | It's not a "phobia", it's "legality". (Hoping not to get sued is not a very robust approach.) Since we are (re)distributing code, the GPL license terms won't let us use GPL code. (Un)fortunately, we are stupid enough to be at the end of the license spectrum where everyone can use our code, be we cannot use theirs. You can use GPL code commercially, you just cannot (re)distribute it under another license. It's not a choice of using inferior code in statsmodels, the implementation of different algorithms depends on the availability of code that we are **allowed** to use and redistribute.
If you have examples and use cases where different methods for outlier detections work well, then I would be very interested in hearing more. I was looking into outlier detection and robust estimation for the various models and cases that we have, but it dropped several times from the top of my priority lists. Josef |
| Re: [pystatsmodels] Re: O(n log n) medcouple implementation | Jordi Gutiérrez Hermoso | 18/01/15 18:33 | (Thank you for manually CC'ing me in your replies)
Hello again. I have managed to implement the fast medcouple in Python: http://inversethought.com/hg/medcouple/file/9a531463663e/medcouple.py Sadly, it's still not very fast, so I then translated it as literally as I could into standard C++: http://inversethought.com/hg/medcouple/file/9a531463663e/jmedcouple.c%2B%2B This is the original GPL'ed R implementation in C, here slightly modified to run with pure standard C++ (no R API): http://inversethought.com/hg/medcouple/file/9a531463663e/medcouple.c%2B%2B I am happy to report that my C++ implementation seems to be as fast as R's C implementation. I also find it a bit easier to read, and probably within the coming days I'll be writing some expository material on the O(n log n) algorithm, which took me quite a bit of work to understand and many, many hours of debugging to get right.[1] You can clone the entire repo of my work here: hg clone http://inversethought.com/hg/medcouple If you cannot be persuaded to use Mercurial, which some of us vastly prefer to git, just grab a tarball: http://inversethought.com/hg/medcouple/archive/9a531463663e.tar.gz This includes some artificial test data, the original mex-file implementation by the original authors of the medcouple paper (warning: non-free, forbids commercial use), as well as a slow m-file implementation that I used during debugging. You can run these on GNU Octave. A skeletal Makefile is included to ease compilation of the mex files and the C++ implementations. Now, as to whether this code can go into statsmodels or not... I think my code ought to be GPL'ed. I used the R code quite a bit, as it was my reference implementation, but I also wrote most of the Python code purely by reading the relevant papers. I did grab some ideas from the R code for the Python implementation. I am not sure if it would be fair to say my code is not derivative of the R implementation if I just remove some of the ideas. Perhaps it would be, and perhaps I could then fairly say my code is not GPL'ed. I just don't know. The safest thing would be to put my code under the GPL. I think there is some misunderstanding on your part or mine as to what putting GPL'ed code into statsmodels would entail. It would be true that it would effectively make all of statsmodels GPL'ed, but I don't think this would be the tragedy you seem to think it would be. After all, good Python packages for scientific computing such as Sage are GPL'ed, and nothing dreadful has happened to them. At worst, if we cannot come to an agreement on the virtues of the GPL, you could do what Scipy does with FFTW, and banish the GPL'ed code to a "contrib" location where the end user does not obtain it unless they want more performant code and they specifically request it. HTH, - Jordi G. H. [1] The R implementation does some weird tricks in order to irregularly do 1-based indexing in C, which made it a bit more difficult to understand. Naturally, my Python and C++ implementations use 0-based indexing consistently. Going back and forth between the two kinds of indexing resulted in many difficult to track off-by-ones. |
| Re: [pystatsmodels] Re: O(n log n) medcouple implementation | Jordi Gutiérrez Hermoso | 18/01/15 18:37 | On Fri, 2015-01-09 at 10:45 -0500, josef...@gmail.com wrote:I may well soon show you this data, after making sure I have anonymised it correctly. The adjusted boxplot worked remarkably well and gave a better idea of what "true" outliers were in most cases. I feel like my work on the medcouple has been worth it. - Jordi G. H. |
| Re: O(n log n) medcouple implementation | Sturla Molden | 18/01/15 21:29 | On 08/01/15 19:47, Jordi Gutiérrez Hermoso wrote:numpy.partitition does this, albeit with a stronger algorithm than std::nth_element (NumPy uses introselect instead of quickselect). If you need C code, just rip introselect out of npysort. Sturla |
| Re: [pystatsmodels] Re: O(n log n) medcouple implementation | josefpktd | 18/01/15 22:24 | There is no misunderstanding numpy, scipy, pandas, scikit-learn, statsmodels and so on are BSD licensed and stay that way. matplotlib is Apache, IIRC. This is one of the BSD/MIT neighborhoods of the open source community, with maybe some Apache and PSF license sprinkled in. There is no problem copying each others code or moving it between packages nor a problem with commercial usage and licensing. Changing the license is not up for debate. Sage is an island, AFAICS, while we are a medium sized country in the middle of a continent (or, more accurately, we are a bit outside the middle). (Usually when I'm browsing for packages, the first thing I check is the license, and if it's not BSD compatible, I stop reading unless they have some interesting documentation.)
scipy dropped all build-in support for FFTW and any other BSD incompatible packages, AFAIK. statsmodels can optionally use CVXOPT, which is GPL, if it is installed on the users machine. That's currently the only optional non-BSD compatible dependency, AFAIK. We don't rule this case out. Josef
|
| Re: O(n log n) medcouple implementation | Sturla Molden | 18/01/15 22:54 | On 19/01/15 07:24, josef...@gmail.com wrote:MKL is commercial, so that is clearly not the case. FFTW was dropped from NumPy becuase nobody had time to maintain bindings for multiple FFT libraries. Sturla |
| Re: [pystatsmodels] Re: O(n log n) medcouple implementation | Alan G Isaac | 19/01/15 11:57 | On Sun, Jan 18, 2015 at 9:33 PM, Jordi Gutiérrez Hermoso <jor...@octave.org> wrote: > I think my code ought to be GPL'ed. I used the R code quite a bitDon't forget that you can always ask the author of GPL'd code for permission to release any part of it under a BSD license. So if you would prefer BSD or MIT, at least ask about that. fwiw, Alan Isaac PS This incident illustrates (once again) how GPL'd code interferes with the kind of sharing we would hope would dominate academic research. The GPL only makes sense when there is realistic fear that others would see a real advantage to producing substantial unshared improvements to distribute as closed source *and* would otherwise let the GPL force them to open source their (just as great) efforts. |
| Re: [pystatsmodels] Re: O(n log n) medcouple implementation | Jordi Gutiérrez Hermoso | 19/01/15 13:21 | I am very frustrated with the severe misunderstanding most of you are
exhibiting regarding the GPL and software licensing in general. It appears to me that you are fearing that which you do not understand. I do not think it will be fruitful to attempt to clear this misunderstanding. I am thus sorry we could not come to an agreement, and sorry that you are refusing to even read my code. I would like to reiterate that you are rejecting perfectly free code due to a strong cultural inertia of misplaced GPL fears. Very well, my code will not be accepted into statsmodels. Your choice, your loss. On the bright side for you, I plan to write some expository material over the coming week which will make it easier for anyone to implement the O(n log n) medcouple algorithm. It was quite a bit of work to untangle this algorithm from the various papers and the R implementation. You may consider this the design spec of a future clean-room reverse engineering of the algorithm. Hopefuly you can read my description, and from that reimplement the algorithm however you wish. If you are interested in seeing my exposition once I'm done, please let me know, and I will notify you. - Jordi G. H. |
| Re: [pystatsmodels] Re: O(n log n) medcouple implementation | Matthew Brett | 19/01/15 13:31 | Hi,
I realize that these kinds of discussions don't often achieve anything, but I just wanted to check that you understood that the statsmodels team would have to change the license of the whole library just to use your code? Cheers, Matthew |
| Re: O(n log n) medcouple implementation | Sturla Molden | 19/01/15 17:11 | On 19/01/15 22:21, Jordi Gutiérrez Hermoso wrote:I understand. But if you want to play ball, you have to play by the rules and be a team player. You are now in the BSD/MIT/Apache corner of the OSS community. There is nothing that prevents you from releasing your code under the GPL, you just cannot do it here. :-) Sturla |
| Re: [pystatsmodels] Re: O(n log n) medcouple implementation | Kevin Sheppard | 19/01/15 17:16 | Either a gist or a small repo would be a nice way to distribute it. If you want a challenge, you could see if you can get a numba version which is really nice for end users since it will perform similarly to Cython without any compilation. In fact, it is simple to copy paste numba code from a gist, unlike a Cython which requires a bit of setup to work. |
| Re: O(n log n) medcouple implementation | Sturla Molden | 19/01/15 17:24 | On 19/01/15 07:54, Sturla Molden wrote:That is: NumPy had code that could optionally use FFTW for FFTs. But it never had GPL'd code in the NumPy codebase. Also, NumPy was never linked with FFTW in its official binary installers. Linux distros, OTOH, could choose to have their packaged version of NumPy linked with their packaged version of FFTW. It was not GPL that killed this but lack of developer time for maintenance. But note the difference: Optionally calling third-party GPL'd code [what NumPy did] vs. including GPL'd code in the code-base [what was proposed here]. Sturla |