storage modality: dataframe-like and matrix-like

31 views
Skip to first unread message

A.J. Rossini

unread,
Nov 3, 2012, 7:23:30 PM11/3/12
to lisp...@googlegroups.com
Ok gang, here's one of the issues we have to get people to think about, related to infrastructure.  It's hidden in a post I made, but needs a top-level topic for it.  I wrote:

> Dataframe-like objects support general data, and given a model and
> assumptions, map to a matrix-like object for computation.  One of the
> deep secrets that you need to know about going between data and data
> analysis is that there is a mapping, sometimes the identity mapping,
> that goes on between the dataframe-like and the matrix-like that will
> be used for computing quantities, and then the results are attributed
> back as metadata onto the dataframe-like.
>
> We've got the general infrastructure there, lisp-matrix introduces the
> matrix-like object which uses different storage types depending on the
> system, with the criteria that the values of the cells in the matrix
> are numbers that can be computed with (and that there is a common
> class that holds all the values in the matrix-like.   dataframe-like
> is similar, but with column-enforced typing (planned future feature
> :-).
>
> So repeat after me:
>
> Dataframe-like + model/assumptions -> matrix-like which can be
> computed with -> metadata and derived data based  on model/assumptions
> for dataframe-like


and here's the basic issue.   We want to work with data (using CSV format, variable names as strings in first row):

"Height in cm as integer", "weight in kilo as double float",  "Gender as string", "Favorite color as string"
160 , 50.4d0, "Male",  "Red"
143 , 45.7d0, "Female", "White"
132 , 30.0d0, "Female", "Blue"

and this would be in the dataframe-like.  However, this would probably map to a matrix-like such as:

"Height in cm as integer", "weight in kilo as double float",  "Gender as string Male", "... female", "Favorite color as string white", "... red", ".... blue"
160 , 50.4d0, 1,  0,  0, 1, 0
143 , 45.7d0, 0,  1,  1, 0, 0
132 , 30.0d0, 0,  1,  0, 0, 1

and the column expansion of the factors (gender, favorite color) is done under particular assumptions (yes, it might seem natural, but there are a few other coding approaches, and this is just the naive one, not the best one, actually!).

Hope this helps...

best,
-tony



Mirko Vukovic

unread,
Nov 3, 2012, 10:23:28 PM11/3/12
to lisp...@googlegroups.com

I am a bit confused about the status of data-frames.  Has any code been written yet?

I go back again to the Chapter 27 of Practical Common Lisp where Peter Seibel discusses an MP3 database.  His code has nice features for querying and modifying and selecting parts of the database

How is that database format related to data-frames?  (Sorry, I'm a physicist, so I could be missing some of the finer points)

The reason I ask is that I implemented a part of the PCL code for a data-format that Tony posted above.  If this database format is relevant to the data-frame requirements, I can clean-up the code quickly, post it on github, and it could serve as a starting point for data-frames.

Internally, data is stored in a vector of extendable vectors.  Top level vector stores columns and each sub-vector stores the data for one column.  The data structure contains a schema with the table meta-data (column names, types, etc).

Mirko


David Hodge

unread,
Nov 3, 2012, 11:09:26 PM11/3/12
to lisp...@googlegroups.com
Hi Mirko,

Yes.

Quite a lot actually, but needs to be cleaned up and functional things like summarise etc want to be written.

look in src/data/dataframe.lisp

I would be interested in seeing your code though.

And, even more interested in understanding your gnu plot grammar of graphics - is that ready for external use, or more a private project?

Cheers


--
You received this message because you are subscribed to the Google Groups "Common Lisp Statistics" group.
To post to this group, send email to lisp...@googlegroups.com.
To unsubscribe from this group, send email to lisp-stat+...@googlegroups.com.
Visit this group at http://groups.google.com/group/lisp-stat?hl=en.
 
 

A.J. Rossini

unread,
Nov 4, 2012, 2:26:00 AM11/4/12
to lisp...@googlegroups.com
David is right - it's a work in progress.  They inherit from the lisp-matrix classes as well, since while they (dataframe-like and matrix-like) are different families, I'd like to leverage the usual lack-of-distinction between them to facilitate faster learning of the system.

This (confusion between dataframes and matrices), IMHO opinion, is what makes the difference between a data analysis/numerical analysis/symbolic analysis system like MATLAB, IDL, Macsyma, etc, and a statistical analysis system like R, S-PLUS, SAS, etc.  While I'd like to and think that we could eventually achieve both (probably about the time I retire), I want to enforce that the dataframe needs a model and assumptions (about data, about model, about what the analyst wants to do) in order to provide an appropriate design or model-matrix for computing on (even something as simple as a 5-num summary, could benefit, if you can code up and recognize the assumption that "all those estimates of the voting population are really noisy and coming from different sources on the political spectrum, are by themselves not quite correct for the overall question of who will win" -- and have the system recommend alternative approaches based on the ontological specification of the model and data/model assumptions.

Again, the above outlines a long-term plan, and we still have to figure out what can be done by next week -- the ontology alone (at least one solution that I have in my head, there are probably simpler ones...) requires 2-3 PhD dissertations (all of which I'd love to sponsor :-).

best,
-tony

Tamas Papp

unread,
Nov 4, 2012, 4:38:29 AM11/4/12
to lisp...@googlegroups.com
I don't think that there is a unique or "best" mapping from a data frame
to a matrix, as the same data could be mapped in various ways (and I am
not thinking about ordering/permutations). Also, sometimes it does not
make sense to map into a single huge regression matrix (eg
multilevel/hierarchical models, also numerical issues with large design
matrices).

Anyhow, I think that data frames are an OK way to manage data, and it
would be good to have a common standard for that, but I would prefer if
the data frame libray was separate and the matrix/regression library
could just build on that.

Best,

Tamas

A.J. Rossini

unread,
Nov 4, 2012, 4:51:01 AM11/4/12
to lisp...@googlegroups.com
I agree about "no unique or best" -- and more importantly, often you
do have a 1-to-many mapping between a dataframe and the resulting
matrices used for data analysis (i.e. consider a set of measurements
from 1000 situations under 2 conditions, and a t-test-like comparison
of each individually, i.e. the typical "gene expression problem --
there you'd have one dataset, and 1000 matrixes for each gene, under
one possible computational scenario).

Remember that "copies" don't have to be copies, don't bring up the
space problem as a red-herring here...

And methodologically, we agree -- large datasets aren't necessarily
the correct way to go in many problems, there are smaller copies that
are more appropriate and targetted, but I'm digressing into data
analysis strategies...

> Anyhow, I think that data frames are an OK way to manage data, and it
> would be good to have a common standard for that, but I would prefer if
> the data frame libray was separate and the matrix/regression library
> could just build on that.

actually, just the opposite seems to be currently in place (building
dataframes as generalized matrices, as opposed to matrices as
specialized dataframes). Not sure why you think the other way is more
natural (and would be curious to know your line of thinking here?)

best,
-tony

blind...@gmail.com
Muttenz, Switzerland.
"Commit early,commit often, and commit in a repository from which we
can easily roll-back your mistakes" (AJR, 4Jan05).

Drink Coffee: Do stupid things faster with more energy!

Tamas Papp

unread,
Nov 4, 2012, 5:37:56 AM11/4/12
to lisp...@googlegroups.com

On Sun, Nov 04 2012, A.J. Rossini <blind...@gmail.com> wrote:

> On Sun, Nov 4, 2012 at 10:38 AM, Tamas Papp <tkp...@gmail.com> wrote:
>>
>> Anyhow, I think that data frames are an OK way to manage data, and it
>> would be good to have a common standard for that, but I would prefer if
>> the data frame libray was separate and the matrix/regression library
>> could just build on that.
>
> actually, just the opposite seems to be currently in place (building
> dataframes as generalized matrices, as opposed to matrices as
> specialized dataframes). Not sure why you think the other way is more
> natural (and would be curious to know your line of thinking here?)

I consider data frames and matrices two separate data structures, and I
do not think of one as a specialized version of the other.

For me, data frames are a collection of vectors of the same length.
These vectors (the "columns") of the data frame can be accessed by keys
("variable names"). The vectors may or may not be
specialized/restricted to a given type (numeric, float, double-float
etc), and may or may not be implemented as CL vectors (eg a vector of
vectors can be saved as a matrix, this can be done very neatly in CL
using displaced arrays).

I have a library that implements precisely this, which I intend to
release after the plotting library and the new version of
array-operations. Alas, some personal issues have kept me away from
work this week, but I hope I can get to some coding today.

Best,

Tamas

Mirko Vukovic

unread,
Nov 4, 2012, 9:55:19 AM11/4/12
to lisp...@googlegroups.com


On Saturday, November 3, 2012 11:09:29 PM UTC-4, David Hodge wrote:
Hi Mirko,

Yes.

Quite a lot actually, but needs to be cleaned up and functional things like summarise etc want to be written.

look in src/data/dataframe.lisp

I would be interested in seeing your code though.

And, even more interested in understanding your gnu plot grammar of graphics - is that ready for external use, or more a private project?


Note: all references here are to second edition of Wilkinson's Grammar of Graphics book.


I programmed a bit on grammar of graphics about a year ago  (it is on github).   I then realized that I took the examples of Chapter 5 on the algebra to literally.  Also I never really got the tables and diagrams on facets (Chapter 11 of second edition).

For the back-end, I used my version of interface to gnuplot, which consists of several parts:
- raw interface to the gnuplot process
- library for generating gnuplot command strings
- a GoG<->interface which itself consists of two libraries, one generic, and one gnuplot specific

I never got to implementing a front-end, which would strive to be driver independent.

Even with my faulty GoG implementation, I found value in generating plots with it, and I am itching to clean it up.  With the above breakdown and some patience, I think one can proceed as follows:
* GoG syntax in CL lisp
 - define and implement a syntax for GoG statements in CL

* GoG interpreter
 - Implement GoG interpreter that will translate some of GoG statements (such as data functions -- Section 3.1) into actual data objects

* GoG back-end(s)
 - back-end that will generate gnuplot or other driver commands

One does not need to implement the whole syntax before generating the whole interpreter, to write the whole back-end.  We could start with a very small subset of GoG to produce a simple plot, and then continue adding.

I hope to do work on GoG in the next few months along those lines.

Mirko

 

David Hodge

unread,
Nov 5, 2012, 12:20:45 AM11/5/12
to lisp...@googlegroups.com
I am extremely interested in helping

I think ggplot in R is the bees knees

And something similar but faster in the CL world would be a great benefit.

David Hodge

unread,
Nov 5, 2012, 1:14:41 AM11/5/12
to lisp...@googlegroups.com
Hi Tamas

On a different but related topic - right now CLS uses xarray. In array the should be a function called "take", which is called from print-object for instance.

I would just like to verify that take was renamed to "copy-as". If so, that actually explains everything and might warrant a push to quick lisp?

Cheers


Cheers

A.J. Rossini

unread,
Nov 5, 2012, 1:26:06 AM11/5/12
to lisp...@googlegroups.com
I can confirm!

Sent from my iPod what's-a-ma-jiggy

Tamas Papp

unread,
Nov 5, 2012, 1:53:50 AM11/5/12
to lisp...@googlegroups.com
Yes, if I remember correctly.

Then when I abandonned xarray & etc, I decided to use AS-ARRAY (the
generic function is defined in CL-NUM-UTILS). The idea is to convert an
object to a CL array if possible, so that you can interface with all
libraries that use plain vanilla CL arrays. I think that it would be a
good idea to have something like this in a numerical CL library.

Best,

Tamas

A.J. Rossini

unread,
Nov 5, 2012, 3:14:31 AM11/5/12
to lisp...@googlegroups.com
+1 on this.

David Hodge

unread,
Nov 5, 2012, 3:28:30 AM11/5/12
to lisp...@googlegroups.com
So the driver for this was a conversation the other day about data frames vs lisp-matrix, about which many useful pieces of knowledge were revealed.

Now that I have figured out what was happening with XARRAY, and before I actually start to use it for my summarise function then I wonder if I should not and just use Antik.

Cons: Xarray has LOTS of references throughout the source code., though quite clearly not all of its facilities were being used
            Its infrastructure work which is not all that interesting (all this to write a nice little summarise function!)


Pros: Antik is probably the future target and getting this done now cleans up the code base and remove a dependancy on a deprecated library.

So, while I am waiting to hear peoples various opinions on the topic, I will probably just write my routine anyway and then, depending on the vote do the infrastructure piece do that

So, the question du jour is - integrate Anitk now or later?

A.J. Rossini

unread,
Nov 5, 2012, 3:42:16 AM11/5/12
to lisp...@googlegroups.com
On Mon, Nov 5, 2012 at 9:28 AM, David Hodge <david...@gmail.com> wrote:

> Cons: Xarray has LOTS of references throughout the source code., though
> quite clearly not all of its facilities were being used
> Its infrastructure work which is not all that interesting (all
> this to write a nice little summarise function!)

Basically, xarray is just an API sitting on top of aref, for doing
things with arrays which would work in general with any rectangular
data structure for which a backend is available. Very simple scope,
just requires methods to be written for any new structure. It's
nothing more. The idea is that aref / mref / #ref become xref, and
that column binding, subsetting, etc, can be done using the same API
for any structure. So code readabiliy is independent of the backend,
be it lisp-matrix, gsll, matlisp, etc...

What could be done would be to simply provide a backend for the
infrastructure provided through antik and gsll, so that the same code
can basically be used for both. At a performance cost, of course.
For a readability/algorithmic implementation savings by average people
and statisticians.

> Pros: Antik is probably the future target and getting this done now cleans
> up the code base and remove a dependancy on a deprecated library.

Antik is a full bit of numerical infrastructure. It has code for
various things (vectorization, etc) which are out of scope of xarray,
and is a kitchen-sink type package. But it's not overbuilt, and as I
mentioned, provides a decent starting point. BUT, what would you be
doing for access of array elements, and if we simply use an
xarray-style front end, would there be a huge loss?

So, we could use both, or we could just use Antik, assuming that it
plays well with other systems (which can also be adjusted to play well
with antik, this isn't a one-sided story).



> So, while I am waiting to hear peoples various opinions on the topic, I will
> probably just write my routine anyway and then, depending on the vote do the
> infrastructure piece do that

Exactly.

> So, the question du jour is - integrate Anitk now or later?

Be precise in what you are asking -- do you want to give up or modify
a general purpose array-like API in favor of a specific one, or does
Antik also have a general purpose API we can simply edit xarray to
provide?

Optimisation in terms of speed and packages is still slightly
premature (i.e. think through what your point is, given the hugh
differences in intent for the 2 packages, they do not share similar
goals).

A.J. Rossini

unread,
Nov 5, 2012, 3:43:56 AM11/5/12
to lisp...@googlegroups.com
(quick summary, which was missing: Antik will be used, the question
is whether to rewrite xarray's API to use the Antik API or do the
opposite)
--

David Hodge

unread,
Nov 5, 2012, 3:48:41 AM11/5/12
to lisp...@googlegroups.com

AS far as I can see Antik basically provides the same sort of interface as array

its mref and not xref, though….., allows slices and provides a cffi interface etc.

And, you are right, Antik is really functionality of xarray plus a whole lot more, which may or may not be a welcome addition.

I had thought that there was a thought to head down that path I thought I would ask the question and see what came back….:)

For the moment, now that I have fixed array I think its best to leave the decision for later

A.J. Rossini

unread,
Nov 5, 2012, 4:11:58 AM11/5/12
to lisp...@googlegroups.com
(clarifications)

On Mon, Nov 5, 2012 at 9:48 AM, David Hodge <david...@gmail.com> wrote:
>
> AS far as I can see Antik basically provides the same sort of interface as array
>
> its mref and not xref, though….., allows slices and provides a cffi interface etc.
>
> And, you are right, Antik is really functionality of xarray plus a whole lot more, which may or may not be a welcome addition.

There is a good deal there which IS a welcome addition.

> I had thought that there was a thought to head down that path I thought I would ask the question and see what came back….:)

There is a thought to head down that path, it's just a gap analysis
and a replacement (i.e. for example, getting lisp-matrix to play well
with Antik API, if we want that).

The challenge is that the dataframe stuff is built on lisp-matrix --
but I'm not opposed to rewriting stuff at this stage, if it makes
sense. In fact, I KNOW it needs a better rewrite at some point,
either incrementally or by throwing it completely out.

We need to enjoy and embrace the lack of a "backward-compatiblity"
requirement, except for the XLispStat compatibility macro's I need to
write, sometime :-).

Mirko Vukovic

unread,
Nov 5, 2012, 7:41:10 AM11/5/12
to lisp...@googlegroups.com
Since Antik uses aref to access both foreign and CL arrays, and presuming that xarray
only uses aref (I have not looked at the code), xarray should live comfortably in antik.

Mirko Vukovic

unread,
Nov 5, 2012, 7:43:55 AM11/5/12
to lisp...@googlegroups.com
I have compiled antik on clisp & cygwin and also on sbcl & linux.  Antik interacts
with gsl and other libraries as advertized.

I had to make some minor changes for clisp & cygwin which I listed on the antik
mailing list yesterday.  I hope that these changes will make it into the main antik branch.

Mirko Vukovic

unread,
Nov 5, 2012, 10:40:25 AM11/5/12
to lisp...@googlegroups.com


On Monday, November 5, 2012 12:20:47 AM UTC-5, David Hodge wrote:
I am extremely interested in helping

Cool.  Do you have Wilkinson's book handy?
 

I think ggplot in R is the bees knees

Now, can you please translate that one for me :-)

 

A.J. Rossini

unread,
Nov 5, 2012, 4:28:22 PM11/5/12
to lisp...@googlegroups.com
On Mon, Nov 5, 2012 at 4:40 PM, Mirko Vukovic <mirko....@gmail.com> wrote:
>
>
> On Monday, November 5, 2012 12:20:47 AM UTC-5, David Hodge wrote:
>>
>> I am extremely interested in helping
>
>
> Cool. Do you have Wilkinson's book handy?

I'm looking for my copy, might've walked off as I reduced my supply of
bookshelf dust catchers, I mean books...

>>
>>
>> I think ggplot in R is the bees knees
>
>
> Now, can you please translate that one for me :-)
>

"GGplot2 as implemented in R is extremely useful".

:-).

Anyway, I've got some feedback from Deepayan Sarkar regarding a new
approach that he's using for QT-based graphics (and widgets) for R.
If someone wants to hack his ideas into Common Lisp, I'm sure he'd be
happy to chat a bit. I do think that commonqt (or the pre-lim CLOS
version) is worth an eval as a GUI system, if we can provide some
handholding on the install for MacOSX and Microsoft Windows.

A.J. Rossini

unread,
Nov 5, 2012, 4:38:19 PM11/5/12
to lisp...@googlegroups.com
On Mon, Nov 5, 2012 at 1:41 PM, Mirko Vukovic <mirko....@gmail.com> wrote:
> Since Antik uses aref to access both foreign and CL arrays, and presuming
> that xarray
> only uses aref (I have not looked at the code), xarray should live
> comfortably in antik.

xarray currently maps xref to (or has methods added which support):

aref for CL arrays
mref for lisp-matrix arrays
list-computations for list-of-list structures.... (eg 4th object in
2nd nested list...)
fall through to any of those 3 for dataframe-like's with the
corresponding data-stores (CL-array, listoflist, lisp-matrix)

David Hodge

unread,
Nov 6, 2012, 6:30:26 AM11/6/12
to lisp...@googlegroups.com
Mine is ordered and have been waiting for a while, the downside of being in SIngapore
Reply all
Reply to author
Forward
0 new messages