proposal: package to standardize something like R's []

50 views
Skip to first unread message

Tamas Papp

unread,
Feb 26, 2014, 9:02:55 AM2/26/14
to lisp...@googlegroups.com
Hi,

Now that things are picking up, I would like to propose that

1. we agree on a name for a function that is basically equivalent to R's
[],

2. and maybe some related convenience functions,

3. write a trivial package that defines these as these generic
functions, and implements them for CL's container types (arrays, lists,
hash tables).

4. after some initial testing, ask that it is included in Quicklisp.

Rationale: I guess we all like to experiment with different approaches,
but having a common operator would promote convergence and
interoperability in the long run.

My suggestions:

I deliberately wrote R's [] there, as it would should work for similar
objects, eg hash tables, etc. But I would like to keep the S-expression
syntax and make the the accessor an ordinary function. Reader
macros/similar clever stuff should be optional and not needed at all in
the best case.

So I would propose the name $, eg

($ #(2 3 5) 1) ; => 3

and a recursive version $$, eg

($$ #(#(1 2) 3) 0 1) ; => 2

I am open to other suggestions of course, but I would find [] awkward.

Open question: also define generic functions for dimensions? Anything
else?

Best,

Tamas

A.J. Rossini

unread,
Feb 26, 2014, 11:01:23 AM2/26/14
to lisp-stat
Dear Tamas -

I have to admit that I'm a little less than enthusiastic as you phrase it.   Let me try to explain.

Right now, for me, the main point of dataframes is to be able to access data by variable name and subject ID, and to be able to manipulate data using that information so that the programming code's statistical logic (following an analysis plan) is clear, as to the manipulations being made and the subsets or supersets (slicing or building up) of the dataframe can be clarified by looking at the computer program code.

I'm not interested in introducing non-lispy syntax yet, and find either the current array building syntax #2A( ) or the list-of-list structure (with information about "column/variable" orientation, or "row/subject" orientation embedded) to suffice.    I noticed your use of plists, and while they are quite intriguing, still need to wrap my head about them.

Related to the API, I like a combination of xarray and cl-slice (familar packages, eh?)  with a "dataframe-building" component (rbind/cbind on steriods, sort of like your package AFFI provided CL-SLICE on steroids but at a high complexity cost in terms of grasping what one would do.

In fact, if I could have:

1. column names and row names, with some sanity checking
2. column-typing with optional enforcement
3. elements forming arbitrary Lisp classes (base or object-system or anything in-between derived)
4. access and slicing through names in #1
5. appropriate mappers and iterators and collectors

I'd be in heaven.  If the backing store was substitute-able in the way that dataframes in Common Lisp Stat (CLS) are, it would be even better, but I'm fine with just using Lisp arrays and not just any old table format there.

Then, if the language you provided (mentioned) was just a set of functions, generics, or macros on top, it would be super awesome.

I'm still a sucker for clear verbose programming, with shortcuts only when painfully clear to all concerned (which is why I love macro expansions).

So that is currently what I'm thinking of (though it isn't documeneted or implemented yet) in the CLS file src/data/modelframe.lisp (in the LOCAL branch, not yet pushed into MASTER branch).

How does this compare to your proposal?   To be fair, you tend to implement things that work, and I'm nearly 8 years into this, without too much to show (though 5 went missing due to personal family tragedy, but that's life), so I value your ideas and thoughts, and have always found it worth understanding what is behind them when my first response is to disagree.  

What do you think?

In a sense, you are proposing something sort of like XARRAY which I like, since it means that the same code can use different backends, plus a CL-SLICE and a CL-DATAFRAME-COMBINATIONS-ON-STERIODS...

Or...?

best,
-tony





--
You received this message because you are subscribed to the Google Groups "Common Lisp Statistics" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lisp-stat+...@googlegroups.com.
To post to this group, send email to lisp...@googlegroups.com.
Visit this group at http://groups.google.com/group/lisp-stat.
For more options, visit https://groups.google.com/groups/opt_out.



--
best,
-tony

blind...@gmail.com
Muttenz, Switzerland.
"Commit early,commit often, and commit in a repository from which we can easily roll-back your mistakes" (AJR, 4Jan05).

Drink Coffee:  Do stupid things faster with more energy!

Tamas Papp

unread,
Feb 26, 2014, 11:31:32 AM2/26/14
to lisp...@googlegroups.com
Dear Tony,

I have to admit that I don't have a lot of philosophy or long-range
planning behind this. Just that currently I am working in R again, and I
find the generic [] quite handy and expressive for a lot of
situations. I just want something similar in CL that we can agree and
build on.

I think that an equivalent a generic [] is something that we need to
have sooner or later, so I thought we could deal with it now. Otherwise,
all of us will (or have already) introduce incompatible (or even worse,
clashing) syntaxes for the same thing. That is all.

Regarding your observations on our differences in style: I think that I
prefer small-scale, quickly implemented solutions instead of large
designs because I have realized that I lack the ability to plan the
latter, so I would end up putting a lot of effort into code that in the
end does not help me much (this, of course, still happens, despite my
best efforts).

Best,

Tamas

A.J. Rossini

unread,
Feb 26, 2014, 12:05:07 PM2/26/14
to lisp...@googlegroups.com
Sorry, but quick before I sign off for the evening.  How about taking a first pass at what that API and generics would do? While I partially understand why you leverage. [], it isn't a major tool in what I do, which may explain why I do not get it like I feel I should. And since I still don't get it completely, but just like your other packages, a partial realization might helps me clarify the rationale and eventual goals...

Summary- a bit more detail, please!

Best,
-tony
--
Sent from Gmail Mobile

David Hodge

unread,
Feb 26, 2014, 7:57:56 PM2/26/14
to lisp...@googlegroups.com
If I may interpose into the discussion a little.


I absolutely agree with Tamas that implementations that we do should be small and fast as it seems to me that we are all groping towards defining the capabilities we each need. And until thats clear any sort of large scale design is moot, to put it mildly.

For me , most of the stuff I want to do is more time series oriented and dataframes as they are currently constructed in CLS are not that much help.

Some comments on Tony's  list

1. column names and row names, with some sanity checking.
    We have several demonstrations of this - a limited one in the existing CLS, one by Mirko (based on PCL) and one by Tamas (with a very groovy summarisation capability). Lets just choose one and improve it to where we need it to be is my vote.
2. column-typing with optional enforcement
    IBID. I don't know if enforcement is the right word, maybe more stringent checking? And this then requires a common definition of missing data
3. elements forming arbitrary Lisp classes (base or object-system or anything in-between derived)
4. access and slicing through names in #1
    Using Tamas's CL-Slice  & array ops, easy once a candidate DF implementation is agreed.

5. appropriate mappers and iterators and collectors
   

A problem with CLS as it stands right now is the big ball of mud that it is. I actually have found it easier just to use GSLL or write my own stuff because I am never quite sure if the CLS stuff will work. This is a result of the long gestation period i guess and is understandable - but if meaningful work is going to get done then we do have to think about reducing dependancies and clarifying the API so that both users and developers don't get confused.

On a personal note, I have , temporarily at least, retired from corporate life and have picked up from where I left off some months ago - last year was very dynamic for me, to say the least. I will be relocating from Singapore to New Zealand in April so will be on and off line accordingly. I am keen to get going on this though, so if we can get some agreement on what we are doing in the next few weeks I would be more than happy to work on it.

 

27 February 2014 1:05 am
--
You received this message because you are subscribed to the Google Groups "Common Lisp Statistics" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lisp-stat+...@googlegroups.com.
To post to this group, send email to lisp...@googlegroups.com.
Visit this group at http://groups.google.com/group/lisp-stat.
For more options, visit https://groups.google.com/groups/opt_out.
27 February 2014 12:31 am
Dear Tony,

I have to admit that I don't have a lot of philosophy or long-range
planning behind this. Just that currently I am working in R again, and I
find the generic [] quite handy and expressive for a lot of
situations. I just want something similar in CL that we can agree and
build on.

I think that an equivalent a generic [] is something that we need to
have sooner or later, so I thought we could deal with it now. Otherwise,
all of us will (or have already) introduce incompatible (or even worse,
clashing) syntaxes for the same thing. That is all.

Regarding your observations on our differences in style: I think that I
prefer small-scale, quickly implemented solutions instead of large
designs because I have realized that I lack the ability to plan the
latter, so I would end up putting a lot of effort into code that in the
end does not help me much (this, of course, still happens, despite my
best efforts).

Best,

Tamas


27 February 2014 12:01 am
Dear Tamas -

I have to admit that I'm a little less than enthusiastic as you phrase it.   Let me try to explain.

Right now, for me, the main point of dataframes is to be able to access data by variable name and subject ID, and to be able to manipulate data using that information so that the programming code's statistical logic (following an analysis plan) is clear, as to the manipulations being made and the subsets or supersets (slicing or building up) of the dataframe can be clarified by looking at the computer program code.

I'm not interested in introducing non-lispy syntax yet, and find either the current array building syntax #2A( ) or the list-of-list structure (with information about "column/variable" orientation, or "row/subject" orientation embedded) to suffice.    I noticed your use of plists, and while they are quite intriguing, still need to wrap my head about them.

Related to the API, I like a combination of xarray and cl-slice (familar packages, eh?)  with a "dataframe-building" component (rbind/cbind on steriods, sort of like your package AFFI provided CL-SLICE on steroids but at a high complexity cost in terms of grasping what one would do.

In fact, if I could have:

1. column names and row names, with some sanity checking
2. column-typing with optional enforcement
3. elements forming arbitrary Lisp classes (base or object-system or anything in-between derived)
4. access and slicing through names in #1
5. appropriate mappers and iterators and collectors

I'd be in heaven.  If the backing store was substitute-able in the way that dataframes in Common Lisp Stat (CLS) are, it would be even better, but I'm fine with just using Lisp arrays and not just any old table format there.

Then, if the language you provided (mentioned) was just a set of functions, generics, or macros on top, it would be super awesome.

I'm still a sucker for clear verbose programming, with shortcuts only when painfully clear to all concerned (which is why I love macro expansions).

So that is currently what I'm thinking of (though it isn't documeneted or implemented yet) in the CLS file src/data/modelframe.lisp (in the LOCAL branch, not yet pushed into MASTER branch).

How does this compare to your proposal?   To be fair, you tend to implement things that work, and I'm nearly 8 years into this, without too much to show (though 5 went missing due to personal family tragedy, but that's life), so I value your ideas and thoughts, and have always found it worth understanding what is behind them when my first response is to disagree.  

What do you think?

In a sense, you are proposing something sort of like XARRAY which I like, since it means that the same code can use different backends, plus a CL-SLICE and a CL-DATAFRAME-COMBINATIONS-ON-STERIODS...

Or...?

best,
-tony


--
best,
-tony

blind...@gmail.com
Muttenz, Switzerland.
"Commit early,commit often, and commit in a repository from which we can easily roll-back your mistakes" (AJR, 4Jan05).

Drink Coffee:  Do stupid things faster with more energy!
--
You received this message because you are subscribed to the Google Groups "Common Lisp Statistics" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lisp-stat+...@googlegroups.com.
To post to this group, send email to lisp...@googlegroups.com.
Visit this group at http://groups.google.com/group/lisp-stat.
For more options, visit https://groups.google.com/groups/opt_out.
26 February 2014 10:02 pm
Hi,

Now that things are picking up, I would like to propose that

1. we agree on a name for a function that is basically equivalent to R's
[],

2. and maybe some related convenience functions,

3. write a trivial package that defines these as these generic
functions, and implements them for CL's container types (arrays, lists,
hash tables).

4. after some initial testing, ask that it is included in Quicklisp.

Rationale: I guess we all like to experiment with different approaches,
but having a common operator would promote convergence and
interoperability in the long run.

My suggestions:

I deliberately wrote R's [] there, as it would should work for similar
objects, eg hash tables, etc. But I would like to keep the S-expression
syntax and make the the accessor an ordinary function. Reader

A.J. Rossini

unread,
Feb 27, 2014, 11:24:54 AM2/27/14
to lisp-stat
Dear David -

I'm both somewhat in agreement with you (as I theoretically like to get things done) and a bit disagreeing with you (in that I do have a vision for a statistical computing environment that I'd like to make work). 

I agree with your comments re: cl-data-table, I've been looking through it, and it's only missing type-constrains and a means to compute with column names (which I'm thinking about taking a stab at next week when on vacation).   So basically, if I had a:

(with-df-row-wise  my-df
   (loop-and-collect (univariate-math-expression-that-uses-variable-names) ))

(with-df-column-wise my-df
    (m* (transpose-view age) (transpose-view (df-combine weight age)) age))))

i.e  age^t [weight:age]  age 

where [] denotes make-a-matrix action.

Then I'd be set.  On second thought, I will take a stab at this during vacation.   XLispStat bascially supported vectors for data to analyse, and had no concept of dataframes.   It did have a concept of numerical linear algebra, but that was not quite linked.




I'd be happy to hear about proposals -- I'm still alternating between short term and long term objectives on this, coding in different parts of the mud-ball as the whim takes me.  (and an earlier mail should have pointed out where my whims are right now).   And a previous email suggested what to do (and in a nod to David and Tamas, it said, "work in ./examples/  , not in source, and we'll move it into source when we want it there -- ie infrastructure when it really is... unless you feel like wasting time like I am, but I won't apologise for wasting time on something I am having fun with....

best,
-tony

David Hodge

unread,
Feb 27, 2014, 7:22:02 PM2/27/14
to lisp...@googlegroups.com, lisp-stat
Far be it from me to get in the way of someone's fun!

So , if we can say cl-data-table is the agreed jumping off point then great.

The macros you sketch out below should be pretty easy to do, but wiring it into CLS nicely will be a challenge due to the ball of mud, but let's look at that in a few weeks.

I am finishing off my net CDF library and the loess smooth at the moment in between move activities, and that will take at least a week I think

Cheers

Sent from my iPad
<postbox-contact.jpg>
27 February 2014 1:05 am
Sorry, but quick before I sign off for the evening.  How about taking a first pass at what that API and generics would do? While I partially understand why you leverage. [], it isn't a major tool in what I do, which may explain why I do not get it like I feel I should. And since I still don't get it completely, but just like your other packages, a partial realization might helps me clarify the rationale and eventual goals...

Summary- a bit more detail, please!

Best,
-tony


On Wednesday, February 26, 2014, Tamas Papp <tkp...@gmail.com> wrote:


--
Sent from Gmail Mobile
--
You received this message because you are subscribed to the Google Groups "Common Lisp Statistics" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lisp-stat+...@googlegroups.com.
To post to this group, send email to lisp...@googlegroups.com.
Visit this group at http://groups.google.com/group/lisp-stat.
For more options, visit https://groups.google.com/groups/opt_out.
<compose-unknown-contact.jpg>
27 February 2014 12:31 am
Dear Tony,

I have to admit that I don't have a lot of philosophy or long-range
planning behind this. Just that currently I am working in R again, and I
find the generic [] quite handy and expressive for a lot of
situations. I just want something similar in CL that we can agree and
build on.

I think that an equivalent a generic [] is something that we need to
have sooner or later, so I thought we could deal with it now. Otherwise,
all of us will (or have already) introduce incompatible (or even worse,
clashing) syntaxes for the same thing. That is all.

Regarding your observations on our differences in style: I think that I
prefer small-scale, quickly implemented solutions instead of large
designs because I have realized that I lack the ability to plan the
latter, so I would end up putting a lot of effort into code that in the
end does not help me much (this, of course, still happens, despite my
best efforts).

Best,

Tamas


<postbox-contact.jpg>
<compose-unknown-contact.jpg>

--

Tamas Papp

unread,
Feb 28, 2014, 7:54:54 AM2/28/14
to lisp...@googlegroups.com
Sorry, but I don't quite understand what you are asking for.

I think that [] (or, in general, accessors and slicing for array-like
objects) is surely part of every language for data analysis and
numerical work. Do you disagree with this?

Note that I am not saying that every situation should be solved with
using []: sometimes there are more elegant and idiomatic ways of doing
this. For example, in R I can either use

data[(data$a > 9) & (data$b < 112), ]

or

subset(data, (a > 9) & (b < 112))

and I agree that the latter is more elegant etc, but [] is still needed
a lot of the time.

I am currently thinking about a nice syntax for the latter in
cl-data-frame. The "problem" is of course that there are no equivalent
of R's inspectable environments in CL (which is a good thing, lexical
closure using environments can get ugly), so something like

(subset data (with-columns (a b) (and (< 9 a) (< b 112))))

would be necessary where WITH-COLUMNS is a clever macro that creates a
LAMBDA that takes a plist.

Mirko Vukovic

unread,
Apr 5, 2014, 8:19:04 AM4/5/14
to lisp...@googlegroups.com


On Wednesday, February 26, 2014 9:02:55 AM UTC-5, Tamas Papp wrote:
Hi,

Now that things are picking up, I would like to propose that

1. we agree on a name for a function that is basically equivalent to R's
[],

2. and maybe some related convenience functions,

3. write a trivial package that defines these as these generic
functions, and implements them for CL's container types (arrays, lists,
hash tables).

4. after some initial testing, ask that it is included in Quicklisp.

 Stuff deleted

So I would propose the name $, eg

($ #(2 3 5) 1) ; => 3

and a recursive version $$, eg

($$ #(#(1 2) 3) 0 1) ; => 2

I am open to other suggestions of course, but I would find [] awkward.

Open question: also define generic functions for dimensions? Anything
else?

Best,

Tamas


Tamas,

I am not using R, so my only knowledge on this topic is reading briefly the documentation on subsetting.  I read the two examples that you mentioned.  Could you put up several more that would illustrate what you are trying to achieve in CL and how it is done in R?   That would allow me to wrap my head around this problem.

Mirko
Reply all
Reply to author
Forward
0 new messages