ann: yet another data-frame

99 views
Skip to first unread message

Mirko Vukovic

unread,
Jan 6, 2013, 9:19:17 PM1/6/13
to lisp...@googlegroups.com, antik...@common-lisp.net
I just pushed to my github repository my version of data-frames.  You can see a very annotated set of examples (with png plots) on http://github.com/mirkov/data-table/tree/master/user/example1.  It is built on top antik & gsll.

As this code might duplicate work already performed by David Hodge on CL-stat, a couple of comments are in order:

1) I developed this code based on my own needs for some simple data analysis, and started independently of David
2) It was great fun, and I learned some neat programming techniques
3) I *do* plan to move to CL-stat once the crunch related to my current project subsides
4) At that time, I will gladly move on to David's code, and as necessary port some of my stuff to his code (if required)

The annotated examples show:
- reading of a file into a table
- dealing with missing values
- querying and selecting parts of the table
- filling of missing values by interpolation
- linear and non-linear fitting

The first three use a syntax of PCL's chapter 27
The latter two use GSLL.  The examples show the simplified interface for performing interpolation and fits.

Finally, to close with a warning: this code is of fleeting nature, as I do want it to become part of CL-stat.  I am posting it here in case David and others see something useful in it to incorporate in their current work.

Mirko


David Hodge

unread,
Jan 6, 2013, 11:03:48 PM1/6/13
to lisp...@googlegroups.com
On 7/01/13 10:19 AM, Mirko Vukovic wrote:
> I just pushed to my github repository my version of data-frames. You
> can see a very annotated set of examples (with png plots) on
> http://github.com/mirkov/data-table/tree/master/user/example1. It is
> built on top antik & gsll.

THis is a very timely email. I have been experimenting with the existing
lisp stat dataframe approach and have decided that it needs lots of
improvement! To that end i have also taken a branch of the PCL chapter
27 , so it seems great minds think alike.

But Wait! Tamas has also released an alpha version of another dataframe
package too!

So many to choose from!

For me, the impetus to look at other approaches for dataframes came when
I actually tried to do some work with the existing CL-Stat dataframe.
Currently, they just don't add that much value I am afraid. I ended up
having to write a bunch of helpers to do some of the things that mirko
has listed below. I also have done things like linear fitting etc using
GSLL, but from what I see Mirko's solution is nicer and more lispy

What I think we have to do is look more closely at the PCL derived
solutions and evaluate Tamas's approach and make a call as to the way
forward.

Give me a couple of days and I will share my thoughts.

BTW, Mirko, is your gnuplot library working ok? for the moment I was
using cgn - which is very simple , your library seems more ocmplete, but
was not sure if it was ready for others to use?

Cheers



>
> As this code might duplicate work already performed by David Hodge on
> CL-stat, a couple of comments are in order:
>
> 1) I developed this code based on my own needs for some simple data
> analysis, and started independently of David
> 2) It was great fun, and I learned some neat programming techniques
> 3) I *do* plan to move to CL-stat once the crunch related to my
> current project subsides
> 4) At that time, I will gladly move on to David's code, and as
> necessary port some of my stuff to his code (if required)
>
> The annotated examples show:
> - reading of a file into a table
> - dealing with missing values
> - querying and selecting parts of the table
> - filling of missing values by interpolation
> - linear and non-linear fitting
>
> The first three use a syntax of PCL's chapter 27
> The latter two use GSLL. The examples show the simplified interface
> for performing interpolation and fits.
>
> Finally, to close with a warning: this code is of fleeting nature, as
> I do want it to become part of CL-stat. I am posting it here in case
> David and others see something useful in it to incorporate in their
> current work.
>
> Mirko
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Common Lisp Statistics" group.
> To post to this group, send email to lisp...@googlegroups.com.
> To unsubscribe from this group, send email to
> lisp-stat+...@googlegroups.com.
> Visit this group at http://groups.google.com/group/lisp-stat?hl=en.
>
>

Tamas Papp

unread,
Jan 7, 2013, 10:29:25 AM1/7/13
to lisp...@googlegroups.com

On Mon, Jan 07 2013, David Hodge <david...@gmail.com> wrote:

> On 7/01/13 10:19 AM, Mirko Vukovic wrote:
>> I just pushed to my github repository my version of data-frames. You
>> can see a very annotated set of examples (with png plots) on
>> http://github.com/mirkov/data-table/tree/master/user/example1. It is
>> built on top antik & gsll.
>
> THis is a very timely email. I have been experimenting with the existing
> lisp stat dataframe approach and have decided that it needs lots of
> improvement! To that end i have also taken a branch of the PCL chapter
> 27 , so it seems great minds think alike.
>
> But Wait! Tamas has also released an alpha version of another dataframe
> package too!

My library is very similar to R's data-frame, basically it is a list of
vectors with some syntactic sugar and optimizations, that plays well
with the recently released cl-slice and the yet unannounced (but
released) data-omnivore (that reads from CSV and supports all kinds of
wacky decimal number formats).

A short transcript:

CL-USER> (use-package '(#:cl-data-frame #:cl-slice))
T
CL-USER> (defparameter *d* (data-frame :gender #(male male female) :age #(30 31 32)))
*D*
CL-USER> (slice *d* t :gender)
#(MALE MALE FEMALE)
CL-USER> (slice *d* #(0 1) :age)
#(30 31)

> So many to choose from!

Indeed :-) My constraint is that I need something that works, now. So
all my libraries are simple but they are used in my projects.

> What I think we have to do is look more closely at the PCL derived
> solutions and evaluate Tamas's approach and make a call as to the way
> forward.
>
> Give me a couple of days and I will share my thoughts.

Your comments would be very welcome --- but maybe you want to look at
cl-slice first as that provides the key functionality, array slices no
other language has dreamed of (although I am saying this myself :-P).
At this stage the code for all 3 libraries is very simple and short, so
it should be readable. Examples are in the unit tests.

Best,

Tamas

Mirko Vukovic

unread,
Jan 7, 2013, 1:03:45 PM1/7/13
to lisp...@googlegroups.com


On Monday, January 7, 2013 10:29:25 AM UTC-5, Tamas Papp wrote:

On Mon, Jan 07 2013, David Hodge <david...@gmail.com> wrote:

> On 7/01/13 10:19 AM, Mirko Vukovic wrote:
>> I just pushed to my github repository my version of data-frames.  You
>> can see a very annotated set of examples (with png plots) on
>> http://github.com/mirkov/data-table/tree/master/user/example1.  It is
>> built on top antik & gsll.
>
> THis is a very timely email. I have been experimenting with the existing
> lisp stat dataframe approach and have decided that it needs lots of
> improvement! To that end i have also taken a branch of the PCL chapter
> 27 , so it seems great minds think alike.
>
> But Wait! Tamas has also released an alpha version of another dataframe
> package too!

My library is very similar to R's data-frame, basically it is a list of
vectors with some syntactic sugar and optimizations, that plays well
with the recently released cl-slice and the yet unannounced (but
released) data-omnivore (that reads from CSV and supports all kinds of
wacky decimal number formats).

I use a vector of vectors.  I see room for unification here.
 

A short transcript:

CL-USER> (use-package '(#:cl-data-frame #:cl-slice))
T
CL-USER> (defparameter *d* (data-frame :gender #(male male female) :age #(30 31 32)))
*D*
CL-USER> (slice *d* t :gender)
#(MALE MALE FEMALE)
CL-USER> (slice *d* #(0 1) :age)
#(30 31)


Can you do a slice across columns, returning say #(male 31)?

One issue that I had to deal with in my table queries is that while Sibel's code passes a whole row (his data structure was a vector of lists, each list being a record), I pass the row index.  Then I would need to have access to the table, in order to get access to row elements (aref (aref table column-index) row-index)

What would be nice to have (and maybe your code provides for it already) is something along the following lines of pseudo-code

CL> (setf data-table (make-nested-vector 5 10))
CL> (setf (aref! data-table i j) x
       and add so on to fill up the data table...)
CL> (aref! data-table 3 4) ;; to recover data

CL> (setf row (row-slice data-table row-index))  ;; define a row-slice object with its accessor
                                    ;;; now magic happens in the background                                    
CL> (aref! row i)  ;; we can access individual row elements

aref! knows how to deal with vectors of vectors and row-slices.  The row-slice actually contains a pointer to data-table.


BTW, did you notice one of my examples where I select population data from year 1650 until population reaches 10^8 using
(defparameter *table-1800->1e8*
  (select *raw-table*
:where (matching-rows *raw-table*
(list 'year 1800 #'>=)
(list 'population 100000000 #'<=)))


 
> So many to choose from!

Indeed :-)  My constraint is that I need something that works, now.  So
all my libraries are simple but they are used in my projects.

> What I think we have to do is look more closely at the PCL derived
> solutions and evaluate Tamas's approach and make a call as to the way
> forward.
>
> Give me a couple of days and I will share my thoughts.


Yep
 
Your comments would be very welcome --- but maybe you want to look at
cl-slice first as that provides the key functionality, array slices no
other language has dreamed of (although I am saying this myself :-P).
At this stage the code for all 3 libraries is very simple and short, so
it should be readable.  Examples are in the unit tests.

I am in the same boat as you.  I will keep using my library for the current project, but I do want to participate in this discussion to
see how to proceed forward.
 


Best,

Tamas

Mirko

Mirko Vukovic

unread,
Jan 7, 2013, 1:21:17 PM1/7/13
to lisp...@googlegroups.com


On Sunday, January 6, 2013 11:03:48 PM UTC-5, David Hodge wrote:
On 7/01/13 10:19 AM, Mirko Vukovic wrote:
> I just pushed to my github repository my version of data-frames.  You
> can see a very annotated set of examples (with png plots) on
> http://github.com/mirkov/data-table/tree/master/user/example1.  It is
> built on top antik & gsll.

THis is a very timely email. I have been experimenting with the existing
lisp stat dataframe approach and have decided that it needs lots of
improvement! To that end i have also taken a branch of the PCL chapter
27 , so it seems great minds think alike.

But Wait! Tamas has also released an alpha version of another dataframe
package too!

So many to choose from!

For me, the impetus to look at other approaches for dataframes came when
I actually tried to do some work with the existing CL-Stat dataframe.
Currently, they just don't add that much value I am afraid. I ended up
having to write a bunch of helpers to do some of the things that mirko
has listed below. I also have done things like linear fitting etc using
GSLL, but from what I see Mirko's solution is nicer and more lispy

What I think we have to do is look more closely at the PCL derived
solutions and evaluate Tamas's approach and make a call as to the way
forward.

Give me a couple of days and I will share my thoughts.

As I see it, we all now have code that satisfies our needs.  This removes the pressure of producing something quickly.  It gives us time to sit back, and use our current experience to design something better.  I agree with Tamas' approach of well defined and designed small libraries that do something well.  We can step back and see what are those small libraries we need.  Then the data-frame can be built on top of them.

 

BTW, Mirko, is your gnuplot library working ok? for the moment I was
using cgn - which is very simple , your library seems more ocmplete, but
was not sure if it was ready for others to use?


It is rather complete, but it depends on some of my unreleased code.  And it is completely undocumented.  And it shows off some really poor coding skills :-)

I think it is better that you stick to what you are currently using, and after dataframes we move on to GoG

Mirko

Tamas Papp

unread,
Jan 8, 2013, 3:48:58 AM1/8/13
to lisp...@googlegroups.com

On Mon, Jan 07 2013, Mirko Vukovic <mirko....@gmail.com> wrote:

> On Monday, January 7, 2013 10:29:25 AM UTC-5, Tamas Papp wrote:
>
>> My library is very similar to R's data-frame, basically it is a list of
>> vectors with some syntactic sugar and optimizations, that plays well
>> with the recently released cl-slice and the yet unannounced (but
>> released) data-omnivore (that reads from CSV and supports all kinds of
>> wacky decimal number formats).
>>
>
> I use a vector of vectors. I see room for unification here.

Possibly. I think that your library is a bit more complex than I need
at this point. Currently mine is less than 350 LOC. There are three
reasons for this:

1. I let CL-SLICE do the heavy lifting (for slices),

2. Columns take care of themselves, you only need to define three
methods for a column implementation (length, summary, slice) and that's
it. Currently only vectors can be columns, but it would be trivial to
extend to other kinds, eg arrays as a column of subarrays, etc.

3. Keeping everything trivial. No super-generic API, no schemas, no
bells and whistles. It is useful to think about my data frames as not
much more than overglorified plists/alists, with a bit of sugar (or
preferably xylitol). Or a saner re-implementation of R's data frame; I
am not aiming for much more than that (apart from the clever SLICE).

>> CL-USER> (use-package '(#:cl-data-frame #:cl-slice))
>> T
>> CL-USER> (defparameter *d* (data-frame :gender #(male male female) :age
>> #(30 31 32)))
>> *D*
>> CL-USER> (slice *d* t :gender)
>> #(MALE MALE FEMALE)
>> CL-USER> (slice *d* #(0 1) :age)
>> #(30 31)
>>
>>
> Can you do a slice across columns, returning say #(male 31)?

(SLICE df row-index t) would do that, but that's a corner case I haven't
handled yet because I haven't decided what that should return, as it
drops a dimensions. Probably it will return a plist, (:gender male :age
30), or an alist, ((:gender . male) (:age 30)).

> One issue that I had to deal with in my table queries is that while Sibel's
> code passes a whole row (his data structure was a vector of lists, each
> list being a record), I pass the row index. Then I would need to have
> access to the table, in order to get access to row elements (aref (aref
> table column-index) row-index)
>
> What would be nice to have (and maybe your code provides for it already) is
> something along the following lines of pseudo-code
>
> CL> (setf data-table (make-nested-vector 5 10))
> CL> (setf (aref! data-table i j) x
> and add so on to fill up the data table...)
> CL> (aref! data-table 3 4) ;; to recover data
>
> CL> (setf row (row-slice data-table row-index)) ;; define a row-slice
> object with its accessor
> ;;; now magic happens in the background
>
> CL> (aref! row i) ;; we can access individual row elements
>
> aref! knows how to deal with vectors of vectors and row-slices. The
> row-slice actually contains a pointer to data-table.

My data frames can handle the first part, with something like

(setf *df* (data-frame :a column :b column ....))
; different creation syntax, need to name columns
(setf (slice *df* i j) x) ; set a single element -- NOT IMPLEMENTED YET
(slice *df* i j) ; get it back
(setf row (slice *df* row-index t)) ; T selects all columns

But then I haven't decided what ROW is --- should it be a plist, an
alist, or a separate kind of structure? The former two are more
transparent, but would slice work on them seamlessly? (It is not
trivial to distinguish lists, alists and plists).

> BTW, did you notice one of my examples where I select population data from
> year 1650 until population reaches 10^8 using
>
> (defparameter *table-1800->1e8*
> (select *raw-table*
> :where (matching-rows *raw-table*
> (list 'year 1800 #'>=)
> (list 'population 100000000 #'<=)))

In my data-frame implementation, you select using bit-vectors and
slice. So you would do it with

(slice *data-table*
(select-rows *data-table* '(year population)
(lambda (year population)
(and (<= 1800 year) (<= 100000000 population)))))

There is some syntactic sugar for binding values for each column to
variables, macros that do that are named with a gerund (-ing):

(slice *data-table*
(selecting-rows (*data-table*
(year year)
(population population))
(and (<= 1800 year) (<= 100000000 population))))

Best,

Tamas
Reply all
Reply to author
Forward
0 new messages