Julia

88 views
Skip to first unread message

Steven Núñez

unread,
Oct 31, 2012, 7:28:04 PM10/31/12
to lisp...@googlegroups.com
Anyone seen Julia? Might be worth having a closer looks as we try to get CLS off the ground. They have (I assume) been thinking about some of the same issues we have, and appear to have a graphics display of some sort already. I especially like the "Designed for Parallelism & Cloud Computing", which despite being laden with marketing hype, would prove quite useful when competing against other systems. It would be interesting to see how they implemented the web-repl and produced the graphs on this page.

Regards,
- Steve

A.J. Rossini

unread,
Oct 31, 2012, 7:31:42 PM10/31/12
to lisp...@googlegroups.com
I really want the lisp paradigm. Julia and similar are, in the
immortal paraphrase of JWZ, "things trying to become lisp but will
fail".

Anyway, agree that we need to start having examples that work, and
clean up into a reuseable system. Remember, I designed a good number
of the initial parallel systems for R.... but the "cloud" thing,
well, that's pure middle-ware hype. BUT, we need to have a well doc'd
API that could be dropped as appropriate into mod_lisp or a similar
ORB-style broker....
> --
> You received this message because you are subscribed to the Google Groups
> "Common Lisp Statistics" group.
> To post to this group, send email to lisp...@googlegroups.com.
> To unsubscribe from this group, send email to
> lisp-stat+...@googlegroups.com.
> Visit this group at http://groups.google.com/group/lisp-stat?hl=en.
>
>



--
best,
-tony

blind...@gmail.com
Muttenz, Switzerland.
"Commit early,commit often, and commit in a repository from which we
can easily roll-back your mistakes" (AJR, 4Jan05).

Drink Coffee: Do stupid things faster with more energy!

Steven Núñez

unread,
Oct 31, 2012, 8:52:57 PM10/31/12
to lisp...@googlegroups.com
I agree, I want the lisp paradigm as well. Mostly because it's better to
add stats capability to a language that you can build real production
systems out of than build a special purpose language that's only useful
for research.

I only mentioned it because they seem to be tackling some of the same
problems we are, and it might be useful to see what technologies they
settled on and why, for example in visualization.

There was an overwhelming silence on the answer to the question of whether
or not we can, now, recreate the series of 'blog posts, so let me ask
again:

Can we? If not, what's left to do? What state is the data frames stuff in?
How well documented is it? If the entire series is too much to tackle
right now, let's start small and where it's useful: data munging.

- Steve

Steven Núñez

unread,
Oct 31, 2012, 9:07:49 PM10/31/12
to lisp...@googlegroups.com
On 2012-11-01 10:31 , "A.J. Rossini" <blind...@gmail.com> wrote:


>but the "cloud" thing, well, that's pure middleware hype.

I just caught this statement (I should have read more carefully -- that's
the trouble with answering emails in real-time whilst at work. I really
should learn how to 'batch' email processing).

Anyway, I have to disagree with the statement that it's 'middleware hype'.
Perhaps the term is, and it's certainly been over-used and seems to be the
equivalent of adding a '.com' to a company name in the early 2000's, but
the use of infrastructure you lease and don't have to build or buy is
directly relevant to some of the things we (well, I) need to do,
especially for companies and individuals that aren't enterprise-class.

For example, Amazon has a 'storage gateway', among other products that
allows a hybrid caching storage scheme for data that's partly on their
disks, partly local. I don't know about the rest of you, but setting up 10
pentabyes of reliable storage to hold datasets isn't something I could
easily do myself. Likewise I can rent, by the hour, a whole range of
machines that would cost me a fortune to build, such as the Extra Large
8-compute cluster instance:

Cluster Compute Eight Extra Large Instance
60.5 GB of memory
88 EC2 Compute Units (2 x Intel Xeon E5-2670, eight-core "Sandy Bridge"
architecture)
3370 GB of instance storage
64-bit platform
I/O Performance: Very High (10 Gigabit Ethernet)
EBS-Optimized Available: No*
API name: cc2.8xlarge


I think that scoring a few runs on the board in this space will definitely
raise our visibility.

- Steve

David Hodge

unread,
Oct 31, 2012, 9:08:49 PM10/31/12
to lisp...@googlegroups.com
I also really want the lisp paradigm - otherwise I would be using R or Octave right now (well , actually I am, but reluctantly)

And once the basic infra is decided then building a "web repl" and using a layer like D3.js would not be all that hard (at least in theory).

But, I like your challenge and certainly I can do the first part of the data munging stuff in lisp right now, though the workflow is not quite as smooth yet. The sticking point is the models which i only just started looking at seriously last night and frankly its a lot to grok. One thing that s clearly lacking is the ability to filter rows etc, though that should be straightforward to put in.

Possibly I can convince Tony to get on a Skype call over the weekend and discuss that in a little detail as I am sure he has a lot of the details in his head, and it would be good to make f2f contact, even if its via electrons!

To this point we are al talking about a "lispy dsl" , but I wonder what that exactly would look like.

Part of that DSL has to be a way of importing and filtering data. And, it needs to be relatively succinct for obvious reasons. So any thoughts in that direction that you could share would be good. I will try to write up a proposal in the next couple of days for comment.

Oh, and if we go anywhere near an ORB/RMI that sort of stuff then I will run screaming. Such things have taken years off my life :)

David Hodge

unread,
Oct 31, 2012, 9:24:44 PM10/31/12
to lisp...@googlegroups.com
Hey - I work for a major software vendor. Cloud is largely hype, we having been doing it for years, just the marketing teams have climbed on board :)

However being able to leverage Amazon/Heroku/Rackspace/google etc etc is a brilliant thing. A longer term plan for CLS, for large datasets would to make sure that we could run it with hadoop streams for instance (not hard at all, really).

So I absolutely agree its something that should be in our roadmap. And once we have our basic functionality in place its not going to be hard to do (I have done the obligatory Heroku app, AWS , EC2 explorations using lisp already. Its way cool)

Here is a thought
1. Lets agree to get CLS to a point where we can respond to your "Firm Target".
2. In order to do that we need to start assigning some tasks and thinking about workflow.
3. There is data munging, visualisation, lisp dsl definition, models, unit tests, integration of gsl and documentation & examples to name a few of the tasks that come to mind.

Thoughts? Comments? Snorts of disbelief? Do share!

Steven Núñez

unread,
Oct 31, 2012, 11:53:33 PM10/31/12
to lisp...@googlegroups.com
OK, it seems that data munging is the place to start:

1.The majority of time building something useful is spent munging data
2. It's a pre-requisite for any later model building
3. There's (at least) 3 people interested in data-munging with CLS

Let's with the first step in the 'Firm Target' and knock them off one-by-one. Here's the R equivalent of that first thing we want to do with data-munging:
> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
> nrow(iris)
[1] 150
> table(iris$Species)

    setosa versicolor  virginica 
        50         50         50

Can we do this now and tick off the box? It looks rather simple: load a data-frame, determine the number of rows and create a table from the species column.

If not, how far away is CLS from being able to do this? Mostly that's going to have to be answered by Tony. If we're not here yet, then we have our starting point. I'm willing to jump in and help to get us to this point, but I think Tony (or whoever 'owns' data frames) will need to drive it by assigning tasks. Don't forget too, documentation is nearly as important as code (and I'm happy to that too).

Regards,
    - Steve

Mirko Vukovic

unread,
Nov 1, 2012, 12:58:18 AM11/1/12
to lisp...@googlegroups.com, steven...@illation.com


On Wednesday, October 31, 2012 11:53:36 PM UTC-4, Steven Núñez wrote:
OK, it seems that data munging is the place to start:

1.The majority of time building something useful is spent munging data
2. It's a pre-requisite for any later model building
3. There's (at least) 3 people interested in data-munging with CLS

Let's with the first step in the 'Firm Target' and knock them off one-by-one. Here's the R equivalent of that first thing we want to do with data-munging:
> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
> nrow(iris)
[1] 150
> table(iris$Species)

    setosa versicolor  virginica 
        50         50         50

Can we do this now and tick off the box? It looks rather simple: load a data-frame, determine the number of rows and create a table from the species column.

If not, how far away is CLS from being able to do this? Mostly that's going to have to be answered by Tony. If we're not here yet, then we have our starting point. I'm willing to jump in and help to get us to this point, but I think Tony (or whoever 'owns' data frames) will need to drive it by assigning tasks. Don't forget too, documentation is nearly as important as code (and I'm happy to that too).

Regards,
    - Steve



Maybe I mis-understant this bit of R, but Chapter 27 of Practical Common Lisp may hold the answer to your needs.  It has to do with parsing of an MP3 database: selecting or removing rows, sorting them, inserting new ones.  The techniques in that chapter (it took me a bit to wrap my head about them) may be what you are looking for.

hth,

Mirko

David Hodge

unread,
Nov 1, 2012, 4:11:48 AM11/1/12
to lisp...@googlegroups.com, steven...@illation.com
So, to get the iris dataset in is

1.  (setf *MY-IRIS* (filename.dsv->dataframe "/Users/dbh/Downloads/iris.data"))

……….

2. (nrows *IRIS*)
149

3. (dfhead *IRIS*)
0: 4.9 3.0 1.4 0.2 IRIS-SETOSA 
1: 4.7 3.2 1.3 0.2 IRIS-SETOSA 
2: 4.6 3.1 1.5 0.2 IRIS-SETOSA 
3: 5.0 3.6 1.4 0.2 IRIS-SETOSA 
4: 5.4 3.9 1.7 0.4 IRIS-SETOSA 
5: 4.6 3.4 1.4 0.3 IRIS-SETOSA 
6: 5.0 3.4 1.5 0.2 IRIS-SETOSA 
7: 4.4 2.9 1.4 0.2 IRIS-SETOSA 
8: 4.9 3.1 1.5 0.1 IRIS-SETOSA 
9: 5.4 3.7 1.5 0.2 IRIS-SETOSA 

4. I will have to think about doing table. should not be too hard

So 75% of the way there.

David Hodge

unread,
Nov 1, 2012, 9:32:04 AM11/1/12
to lisp...@googlegroups.com, steven...@illation.com
And this is the reason why I like lisp so much.

After a few minutes of coding , while i was listening to a con call

4. (dfgroupby *IRIS* 4)
IRIS-SETOSA => 50
IRIS-VERSICOLOR => 50
IRIS-VIRGINICA => 50

Not quite the same as R's table, but with not much more work, could be made so. One thing I ail do is allow variable references so we can say
(dfgrouby *IRIS* :species) or some such

Oh, and there is something below where nrows reported 149. 
My bad for cutting and pasting the wrong part of the session.

CLS> (nrows *IRIS*)
150

Now whats missing from a dataframe at the moment is a way of selecting rows, based on some sort of query. That has to be fixed.
But the point of the exercise is to show that we are not that far off something thats usable, imo.

Cheers

Steven Núñez

unread,
Nov 1, 2012, 7:55:51 PM11/1/12
to lisp...@googlegroups.com
Good stuff Dave!

Just wondering though: is there a reason the CLS syntax varies from R? Seems that unless there's some good reason, it will make life easier (and possibly make a R data-munger converter easier) if we stick to the R syntax, especially since data-frames are new to CLS.

Regards,
- Steve

David Hodge

unread,
Nov 1, 2012, 8:38:37 PM11/1/12
to lisp...@googlegroups.com
Hey Steve,

The CLS syntax will vary from R in that it is not R :)

And, frankly, I would not want it to be. 

I guess the point of me whipping those things up was to show that CLS is not too far away from being really usable with a little work, at least for data munging. a proper tutorial and documentation would help someone coming from R though and thats important. and on the to do list. However, part of the reason, for me at least, to use CLS and not R, is not only the scalability and performance aspect, but the fact CLS sits within a really capable programming environment where I do not have to do unnatural  things to get stuff done. For me, I quite dislike R syntax and the myriad little edge cases you just have to know.


The major missing ingredient is our "lispy DSL" for statistics, of which data handling & persistence is a part, followed by having complete and easy to use summary statistics. So I happen to agree with you that filename.dsv->data frame does not fall trippingly off the tongue (or fingers for that matter) and getting good, easy to remember names is important.

Now it turns out that there has been some thinking about this in the past, viz look in the src/data directory and src/describe. 

What has to happen is these thoughts need to get carried through.

One thing that is clearly missing is "missing data". And before we get much further, we need to agree what do about it otherwise our lives will be a misery.

One could certainly steal R's approach, though I am not sure exactly the impact that would have on data frames etc. Which is probably my next area of investigation.

Tony - if you are reading this, let me know. It would be good to get your thoughts

(and if this discussion goes on, lets change the title of the thread - its not about Julia after all!)

Steven Núñez

unread,
Nov 1, 2012, 10:48:37 PM11/1/12
to lisp...@googlegroups.com
I guess the only point was that, IMHO, we shouldn't make things different, just for the sake of being different. I.e. If R has a function called 'table' that we're reproducing (since we're lifting much of the idea of data-frames from R), there's nothing to be gained by calling it 'dftable', unless of course it conflicts with something in another package. This will ease transitions for people coming from R and make writing a DSL for processing R scripts a bit easier. I think you probably are trending down this route anyway.

Function names are a small thing at this point, but worth getting a consensus on before we get too far down the track, because changing later gets exponentially harder. My vote is to stick to R where it's helpful, diverge where it makes sense. R won the war, we can only gain by leveraging its ubiquity especially in areas that are new for CLS.

Now, on to the more interesting topic: data-munging. It does seem like we're not to far away. How are we doing from a 'build engineering' and 'steering committee' perspective? You gave a good example below with the  filename.dsv->data. This brings up two points we should address without, hopefully, a great deal of discussion:
  • How do we agree and move into CLS quickly and easily the convention/idiom/syntax for new constructs?
  • Once we have a new construct/bug fix/improvement, how can we quickly move that into a master repository? How to move it to QuickLisp?
Related questions on infrastructure:
  • Where should bugs be reported?
  • Do we have a continuous build environment?
  • Where is the documentation being hosted?
  • Is there a wiki?
There are free versions of these developer tools available from Atlassian (well, free for open-source projects). Where are we in the boring old task list of items that make software work? We've used bitbucket in the past and they have some rudimentary tools; I think the commercial ones you get upon application as an open-source project.

David Hodge

unread,
Nov 1, 2012, 11:10:20 PM11/1/12
to lisp...@googlegroups.com
Hi Steve,

Thanks for your thought provoking comment!

The thing to consider is that we have an existing code base. Right now the data frame package uses df(something or other) as a convention, that is not totally consistently applied. 
So while I am not particularly stuck on names, there will be a cost in changing everything around to match a somewhat different paradigm in R - which also lacks consistency as well. YMMV

I don't know if we are in competition with R or trying to get R users to adopt CLS at some stage in the future in an OSS marketing program. If the platform we eventually build is worthy enough it will gain users, but I am not sure that its my goal to be able to run R scripts - though its an interesting thought.

We need to codify our lisp dsl asap I guess and then can then drive quite a lot of activity. Right now  am writing a proposal for file import based on the existing import structure - basically (import filename :filetype) where filetype can be csv, structured, sql or whatever. We need to be able to either extract or supply variable names (or factors in R speak) and handle missing data. Most of the mechanics are there for this right now, just need to be consolidated and regularised. 

For software engineering speak. Right now its Git and github. Tony has the master repo. We can report bugs, build a wiki and all that stuff there. I see no compelling reason to change at the moment. In order to get a continuous build environment we need to get the unit tests sorted out as a priority imo

Cheers

Steven Núñez

unread,
Nov 1, 2012, 11:14:55 PM11/1/12
to lisp...@googlegroups.com
Fantastic.

Tony, can you organize logins to all of these tools for us, set-up the bug tracking system, wiki structure, etc.? Once the housekeeping is out of the way it will be a lot easier to move forward as a team.

David Hodge

unread,
Nov 1, 2012, 11:20:28 PM11/1/12
to lisp...@googlegroups.com
www.github.com/blindglobe/common-lisp-stat.

Issues tracking etc already there.

Just need wiki

Steven Núñez

unread,
Nov 1, 2012, 11:26:53 PM11/1/12
to lisp...@googlegroups.com
I think we'll need a bit more than the out-of-the-box to be useful. Wiki needs structure, bug tracking needs modules, trends, etc. I find that linking the issue to the source repository, so you can see the fix is quite useful for code reviews and understanding source.

Then it's just continuous builds and a documentation system.

Regards,
- Steve

From: David Hodge <david...@gmail.com>
Reply-To: "lisp...@googlegroups.com" <lisp...@googlegroups.com>
Date: Friday, 2 November 2012 14:20
To: "lisp...@googlegroups.com" <lisp...@googlegroups.com>
Subject: Re: [lisp-stat] Data Frames, R and Infrastructure (was: Julia)

A.J. Rossini

unread,
Nov 2, 2012, 1:24:01 AM11/2/12
to lisp...@googlegroups.com
Sorry for being out for the last few days. Family issues, I'm an
alone parent, still resolving issues with my late wife's inheritance,
etc...

At this point, I'd rather see code and examples than worry about infrastructure.

Set up a github account, clone the repo, download locally, and start
playing in the examples directory, which is where I've been doing the
"blog" stuff, using either org-mode as a literate programming tool or
just lisp-plus-comments.

The general principle:
1. write what you think you want to write, and then
2. write what you need to write,
3. and then we'll gap-analysis the difference.

So for example, I'm trying to get the t-test working :-). I've got
code (not checked in) that does it, but it's #2, not yet #1.

For tables -- #2 looks like what David put together, but what do we
want for #1? Well, we need a table- or tabular- or
cross-classification- class (CC-class), which represents counts, and
is completely different than a dataframe. Ideally, in that CC-class
structure, we want to store how it got there (i.e. if from a file, a
quick metadata record, if from a data-frame, etc...), so that we can
audit the path.

This will be important later, see comments on reproducibility when I
wax philosophical in the source code.

For missing data -- the same. We need extensions of the numbers to
include infinity, missingness-categories, as well as nominal and
ordinal categorical variable storage structures.

Again, the TODO.org file in the main directory needs to be updated
with these tasks, and I'll see about this this afternoon during my
afternoon coffee break...

Check in "experiments in how we could do things" into the examples
directory, and just make sure (if possible, if you want) to
distinguish between #1 (done right) and #2 (done right now...).

If you are looking at ways to contribute, and these will get dumped
into TODO.org later:

1. look at Tamas' package which includes infinity, and think about
how to add a few objects which represent various missingness states
(just do a single one, and then we can get the whole semi-heirarch
family of categories, MAR, MCAR, CAR, non-ignoreable, etc...)

2. put David's code for generating declt code into the doc directory
with a makefile so others can work on modifying docs based on the
resulting output. I've got a quick hack for making it work in
quicklisp local's directory, that I'll share and put into the
documentation subdirectory as a note

3. convince Tamas to release his graphics package and see if Mirko's
grammar of graphics is in the right direction, and see if they can be
pieced together as an example (okay, this is huge)

4. figure out how to enforce typing in dataframe columns (should be an
error if someone adds/mods a value which is not the right type).
Figure out how to type the various data forms (comp-sci types,
statistical types, the overlap)

5. write examples, and get'em working so we can refactor and migrate
code into the code base...



I'll start a new thread on the various topics later, maybe tomorrow.

A.J. Rossini

unread,
Nov 2, 2012, 1:30:24 AM11/2/12
to lisp...@googlegroups.com
Two things:

one, I'm fine with filename.dsv->dataframe names for functions, and
encourage them. The base system should be communicative and clean to
read, if a bit verbose and upsetting to those who like abbreviations.

two, since people WILL HATE THEM (above), I'd like to see macros
written that describe proposals for cleaner, terser syntax, (that we
can drop in as "dialects") but please leave the base system a bit
verbose and non-abbreviated. Lots of people are bringing in their own
backgrounds, and I'd prefer to write macros on top that match them,
than to target them.

So then, you just add the macro package of your choice to get a terser
(or R-like, or whatever...) syntax.

The machines are fast and cheap, but brainpower and paradigm shifts
are really expensive, better to show people how to make the system do
what they want (and leverage the power of macros) than try to get it
right for one data analysis domain ( finance, clinical trials,
bioinformatics, environmental, etc...)

Tamas Papp

unread,
Nov 2, 2012, 5:20:41 AM11/2/12
to lisp...@googlegroups.com

On Fri, Nov 02 2012, A.J. Rossini <blind...@gmail.com> wrote:

> Sorry for being out for the last few days. Family issues, I'm an
> alone parent, still resolving issues with my late wife's inheritance,
> etc...
>
> At this point, I'd rather see code and examples than worry about infrastructure.

Hear, hear!

> For missing data -- the same. We need extensions of the numbers to
> include infinity, missingness-categories, as well as nominal and
> ordinal categorical variable storage structures.

I would just use NIL to denote missing data, but maybe you guys have
other requirements.

> 3. convince Tamas to release his graphics package and see if Mirko's
> grammar of graphics is in the right direction, and see if they can be
> pieced together as an example (okay, this is huge)

You don't need to convince me, I am working on it right now. I just
want it to be a bit more polished before releasing it; I expect that the
API will keep evolving though with backward-incompatible changes.

Best,

Tamas

A.J. Rossini

unread,
Nov 2, 2012, 5:41:45 AM11/2/12
to lisp...@googlegroups.com
On Fri, Nov 2, 2012 at 10:20 AM, Tamas Papp <tkp...@gmail.com> wrote:


lots of good and exciting things, with only a minor caveat


> On Fri, Nov 02 2012, A.J. Rossini <blind...@gmail.com> wrote:

>> For missing data -- the same. We need extensions of the numbers to
>> include infinity, missingness-categories, as well as nominal and
>> ordinal categorical variable storage structures.
>
> I would just use NIL to denote missing data, but maybe you guys have
> other requirements.

I do -- my thesis eons ago was on interval censoring, and I really do
want missing data (including censored and coarsened) to be first class
objects and not afterthoughts.. So this would be an extension to a
data class, not just "nil", since there are different levels of
knowledge that are possible, and it would be critical to ensure that
we have appropriate metadata to provide hints as to conformance of
data generating processes to data analysis procedures (and such hints
can be strictly followed, i.e. "you've broken key assumptions, stop it
this instant", or weakly followed -- "yes, I'm modeling binary data
with linear regression, but it might be reasonable").

That being said, using the "Git-R-done" principle, we can move forward
with nil.... As Tamas points out, it's a first pass, just not a final
solution.

Tamas Papp

unread,
Nov 2, 2012, 11:24:32 AM11/2/12
to lisp...@googlegroups.com

On Fri, Nov 02 2012, A.J. Rossini <blind...@gmail.com> wrote:

> On Fri, Nov 2, 2012 at 10:20 AM, Tamas Papp <tkp...@gmail.com> wrote:
>
>
> lots of good and exciting things, with only a minor caveat
>
>
>> On Fri, Nov 02 2012, A.J. Rossini <blind...@gmail.com> wrote:
>
>>> For missing data -- the same. We need extensions of the numbers to
>>> include infinity, missingness-categories, as well as nominal and
>>> ordinal categorical variable storage structures.
>>
>> I would just use NIL to denote missing data, but maybe you guys have
>> other requirements.
>
> I do -- my thesis eons ago was on interval censoring, and I really do
> want missing data (including censored and coarsened) to be first class
> objects and not afterthoughts.. So this would be an extension to a
> data class, not just "nil", since there are different levels of
> knowledge that are possible, and it would be critical to ensure that
> we have appropriate metadata to provide hints as to conformance of
> data generating processes to data analysis procedures (and such hints
> can be strictly followed, i.e. "you've broken key assumptions, stop it
> this instant", or weakly followed -- "yes, I'm modeling binary data
> with linear regression, but it might be reasonable").

Perhaps my perspective is biased because I only do two kinds of
statistics: preliminary checks using simple moments, and full-fledged
Bayesian analysis. For the first, I usually just ignore missing data
(eg when calculating moments, pretend that observation is not there),
and for the second, the reason for missing data (censoring? truncation?
etc) belongs in the model, and I need to deal with that on a
case-by-case basis for each model & dataset.

Maybe we could include a generic function

(defgeneric missing-data? (data)
(:method (data)
nil)
(:method ((data null))
t))

so that functions (eg mean, sd, variance, quantiles) could ignore data
for which missing-data? returns T, and the user could define

(defmethod missing-data? ((data missing-because-the-dog-ate-it))
t)

to introduce new kinds of missing data.

Best,

Tamas

A.J. Rossini

unread,
Nov 2, 2012, 1:03:34 PM11/2/12
to lisp...@googlegroups.com
It's a start, and we can start from that (and the following can be
down-graded to manage such a DSL-approach), but a medium quality
implementation would be closer to:
I'd rather have some form of localizable parameter
*missing-data-management* that suggests:

1. throw-exception-because-we-want-to-experiment-with-assumptions

2. ignore/drop-missing-data (with "by-variable, by-observation,
by...." options)

3. forbid-missing-data (and so through an exception)

4. impute-missing-data (with "imputation model" which would describe
how to replace, by mean of observed, by draw from empirical
distribution, by draw from model-based distribution...)

and this would then suggest how to dispatch the algorithm (with
exception handling to be able to drop in what might be missing. Of
course you'd need a way to detect (and different ways to do that, on a
variable, on an observation, on a single value), and ideally, a
representation as well (SAS uses ".", R uses "NA", I'd think of
gensym-like-symbols or something like that, that might be identifiable
across instances and implementations, ideally).
Reply all
Reply to author
Forward
0 new messages