On Mon, Jan 07 2013, Mirko Vukovic <
mirko....@gmail.com> wrote:
> On Monday, January 7, 2013 10:29:25 AM UTC-5, Tamas Papp wrote:
>
>> My library is very similar to R's data-frame, basically it is a list of
>> vectors with some syntactic sugar and optimizations, that plays well
>> with the recently released cl-slice and the yet unannounced (but
>> released) data-omnivore (that reads from CSV and supports all kinds of
>> wacky decimal number formats).
>>
>
> I use a vector of vectors. I see room for unification here.
Possibly. I think that your library is a bit more complex than I need
at this point. Currently mine is less than 350 LOC. There are three
reasons for this:
1. I let CL-SLICE do the heavy lifting (for slices),
2. Columns take care of themselves, you only need to define three
methods for a column implementation (length, summary, slice) and that's
it. Currently only vectors can be columns, but it would be trivial to
extend to other kinds, eg arrays as a column of subarrays, etc.
3. Keeping everything trivial. No super-generic API, no schemas, no
bells and whistles. It is useful to think about my data frames as not
much more than overglorified plists/alists, with a bit of sugar (or
preferably xylitol). Or a saner re-implementation of R's data frame; I
am not aiming for much more than that (apart from the clever SLICE).
>> CL-USER> (use-package '(#:cl-data-frame #:cl-slice))
>> T
>> CL-USER> (defparameter *d* (data-frame :gender #(male male female) :age
>> #(30 31 32)))
>> *D*
>> CL-USER> (slice *d* t :gender)
>> #(MALE MALE FEMALE)
>> CL-USER> (slice *d* #(0 1) :age)
>> #(30 31)
>>
>>
> Can you do a slice across columns, returning say #(male 31)?
(SLICE df row-index t) would do that, but that's a corner case I haven't
handled yet because I haven't decided what that should return, as it
drops a dimensions. Probably it will return a plist, (:gender male :age
30), or an alist, ((:gender . male) (:age 30)).
> One issue that I had to deal with in my table queries is that while Sibel's
> code passes a whole row (his data structure was a vector of lists, each
> list being a record), I pass the row index. Then I would need to have
> access to the table, in order to get access to row elements (aref (aref
> table column-index) row-index)
>
> What would be nice to have (and maybe your code provides for it already) is
> something along the following lines of pseudo-code
>
> CL> (setf data-table (make-nested-vector 5 10))
> CL> (setf (aref! data-table i j) x
> and add so on to fill up the data table...)
> CL> (aref! data-table 3 4) ;; to recover data
>
> CL> (setf row (row-slice data-table row-index)) ;; define a row-slice
> object with its accessor
> ;;; now magic happens in the background
>
> CL> (aref! row i) ;; we can access individual row elements
>
> aref! knows how to deal with vectors of vectors and row-slices. The
> row-slice actually contains a pointer to data-table.
My data frames can handle the first part, with something like
(setf *df* (data-frame :a column :b column ....))
; different creation syntax, need to name columns
(setf (slice *df* i j) x) ; set a single element -- NOT IMPLEMENTED YET
(slice *df* i j) ; get it back
(setf row (slice *df* row-index t)) ; T selects all columns
But then I haven't decided what ROW is --- should it be a plist, an
alist, or a separate kind of structure? The former two are more
transparent, but would slice work on them seamlessly? (It is not
trivial to distinguish lists, alists and plists).
> BTW, did you notice one of my examples where I select population data from
> year 1650 until population reaches 10^8 using
>
> (defparameter *table-1800->1e8*
> (select *raw-table*
> :where (matching-rows *raw-table*
> (list 'year 1800 #'>=)
> (list 'population 100000000 #'<=)))
In my data-frame implementation, you select using bit-vectors and
slice. So you would do it with
(slice *data-table*
(select-rows *data-table* '(year population)
(lambda (year population)
(and (<= 1800 year) (<= 100000000 population)))))
There is some syntactic sugar for binding values for each column to
variables, macros that do that are named with a gerund (-ing):
(slice *data-table*
(selecting-rows (*data-table*
(year year)
(population population))
(and (<= 1800 year) (<= 100000000 population))))
Best,
Tamas