Status, response to queries from Steve / David, and moving ahead

21 views
Skip to first unread message

A.J. Rossini

unread,
Nov 25, 2013, 1:51:33 AM11/25/13
to lisp-stat
Dear all -

I was in Seattle last week for a scientific advisory board meeting, and took advantage of the time and the flight time to make a bit more progress.   CLS, XARRAY, LISTOFLIST all have some updates in terms of documentation and code.

David recently messaged regarding noticing some progress, and I'm happy to report that is true.

example 04 (data munging, etc) has some progress, which starts to demonstrate some basic data management approaches.  I'm using it to fix the XARRAY access and data conversion across lisp arrays, matrix-like's, dataframe-like's, and hopefully soon, GSLL's matrix structures.  One thing I need to add is a "numeric to categorical data" exchange/transfer/swap functions, so that the purely numerical representations (GSLL, matrix-like from LISP-MATRIX) can interplay and exchange with the more general data storage (lisp arrays, dataframe-like's from CLS).   

There are a few missing chunks for the XARRAY API, namely the ability to simply write/store or serialize data.   This is different than missing implementations for the XARRAY API, as only lisp arrays are supposed to be complete, with LISTOFLIST, MATRIX-LIKE, DATAFRAME-LIKE, and the GSLL structures missing a few things.

Steve N sent a quick query off-line, but I'd prefer to answer the general gist of his message here which were:
1. query re: how would one replace missing values in a column in a dataframe
2. suggest to look into HADOOP

For the first comment/question: I'll put the query and example into examples/04-dataManipulation.lisp, but in the future, it would save time for me if folks could add queries directly into the file in comments at the end of the file, in the right "commented" section, and then I (or someone else) can implement.  So basically, checkout or fork a copy, branch onto local-YOUR-UNIQUE-NAME, put back into github, and request a pull so we can share.

In specific, Steve suggested looking at one of the large public databases, which is a good idea, since I can use the getting data section (probably write an examples/03-gettingWWWdata.lisp file) to demonstrate how to fetch and make accessible, using CLS.  I probably will use a smaller public-use dataset (something on the order of 1-10Mb, not 100Mb or Gb) so that the example files do not take forever.

For the second, HADOOP, there is clearly a equivalence of interfaces -- Common Lisp invented map-reduce strategies eons ago, though not the parallelisation across machines.   But for the specific issue ... we'd need a tie-in to the systems, and I've not got the bandwidth right now to do it.  But then, it ought to be a simple matter of using LPARALLEL (which does have such structures) to do the dispatch to the lower level HADOOP  infrastructure.  That claim is in fact a throw-away, based on limited reading of the APIs and literature for those two systems.  I haven't looked at feasibility. 

As always, things are moving, faster right now, probably with a December slowdown as work piles up, some coding time between christmas and new years when I'm in the mountains, and hopefully, train-located coding time as I try to spend weekends in the mountains this year.

best,
-tony

blind...@gmail.com
Muttenz, Switzerland.
"Commit early,commit often, and commit in a repository from which we can easily roll-back your mistakes" (AJR, 4Jan05).

Drink Coffee:  Do stupid things faster with more energy!
Reply all
Reply to author
Forward
0 new messages