Which CL Dataframe would you choose to work with right now?

185 views
Skip to first unread message

st...@nunez.org

unread,
Jul 12, 2017, 7:53:18 PM7/12/17
to Common Lisp Statistics

It has been some time since there was any traffic. I hope everyone is doing well.

 

I have an opportunity to do a bit of ‘data science’ work, though I hate that term. If you had to do real work, today, what data frame package would you choose? I looked at Rho, however the row-major ordering may be an issue. In most of the work, the data is already column-major order, for example in parquet files, so a row-major data frame might not be the best choice.

 

Tamas Papp has some stuff that looks good, though perhaps lacking in documentation. I like the simplicity of Tamas’ stuff. It certainly seems that is the closest code that is 'in production'.

 

The work here is 'public', in the sense that it is more or less teaching data science for one of the popular online rags. With luck, I hope the traffic from there might help to encourage a community.

 

I see Harvey Stein was attempting to recreate XLISP-STAT a few years back. I think that is an excellent idea. Piece-by-piece, create 'big data statistics'. Harvey, if you are still around, how far did you get?


Is anyone aware of any R or Spark compatible dataframes in common lisp (ideally binary compatible)? Being able to inter-operate with those community project would give us a head start.


Regards,

-              Steve

 


Harvey Stein

unread,
Jun 5, 2018, 5:31:55 PM6/5/18
to Common Lisp Statistics
A little late for a reply... :)  I got into a discussion with someone about R, which led me to wondering about the status of common-lisp-stat, which led me to this post, so none the less, let me respond!

I started hacking on common-lisp-stat (https://github.com/blindglobe/common-lisp-stat, aka cls), to get it to run my old xlispstat code, but got sidetracked by a number of issues.  It was a while ago, but here's a rough, hazy, off the top of my head recollection.

First off, some of my code xlispstat code could be run successfully (with sometimes minor modification) in cls.  This was for code that was mostly numeric, not statistical.  But, it often took a big performance hit.  With a first rate compiler in sbcl, I wanted *better* performance, not worse.

The performance hit was largely due to all of the type testing that gets done in common lisp if you try to use generic functions to add lists of numbers.  So, the vectorized arithmetic operators needed to be speeded up.  I didn't think this could be done in general, but it should at least have been possible if one was willing to instead use carefully typed arrays.  So, for example, if the compiler knows you're adding a float to a vector of floats, it should be able to skip most of the type checking and branching and be as fast (if not faster) than xlispstat.

Doing this was complicated by the fact that there were 2 implementations of vectorization floating around in xlispstat.  There was a simple one which I didn't think was up to the task, and the complicated one (from antik) potentially being able to do this.  So I started tweaking antik, and was in fact able to speed up this case by a factor of 10.  But not consistently.  For example, adding the two typed vectors was fast, but map in sbcl didn't take the types into account, so it wasn't faster.  Fixing this would require hacking type inference in sbcl, which I started getting into, but was complicated.

Moreover, the testing and tuning was hampered by some GC bugs in SBCL (https://bugs.launchpad.net/sbcl/+bug/1446962).  It was sometimes crashing instead of collecting old garbage when lots of large arrays were used.

On top of this, there were other points in cls that were complicated.  There was some of the original xlispstat statistics code there, but there was also work in progress on switching over to gsl.  I started trying to fix/resurrect the xlispstat stat code before I realized that it was being retired.  Then I started implementing xlispstat compatible stats code using gsl, but sometimes gsl used different algorithms which gave back different results.  For example, consider:

> (help 'chol-decomp)
loading in help file information - this will take a minute ...done
CHOL-DECOMP                                                     [function-doc]
Args: (a)
Modified Cholesky decomposition. A should be a square, symmetric matrix.
Computes lower triangular matrix L such that L L^T = A + D where D is a diagonal
matrix. If A is strictly positive definite D will be zero. Otherwise D is as
small as possible to make A + D numerically strictly positive definite. Returns
a list (L (max D)).

The Cholesky decomposition in gsl didn't work the same way.

Then there were the data frames.  xlispstat didn't have them & work was starting on adding such things to cls.

Then there was package symbol handling.  Importing a vectorized version of *, for example, didn't work exactly as I had expected.  The issue is that the symbol * is used in declarations in common lisp.  In the declarations, the symbol * has to come from the common-lisp package.  So, overriding * with antik:* breaks the common form of declarations that people use.

I also hit a transitory problem with quicklisp - some temporary package incompatibility which interrupted work.

The bottom line was that a) it was turning into more work than I had time for at the time, b) common-lisp-stat was going in a somewhat different direction (with the data frames rather than straight xlispstat compatibility), and c) other packages had to be modified (like antik) and thus changes had to be contributed to them, accepted by the authors, etc.  So I stopped working on it.

I still think it'd be possible to get a reasonably compatible version of xlispstat in comon lisp, but it'll take some compromises & it won't be 100% compatible.  But it'll take more time than I have to spend on it.

-- Harvey
Reply all
Reply to author
Forward
0 new messages