# Factors and Missing Values

9 views

### Steve Nunez

May 12, 2021, 11:43:09 PM5/12/21
I thought I'd start with a new thread since the subject is different. I've been thinking about how best to implement factors and missing values. Missing values I think is relatively easy. There was an earlier discussion on this (under a thread named 'Julia'), and I've got it mostly done, using nil as the symbol for missing values. Room for improvement, but enough for a start.

For factors, I looked at the factor implementation in Rho, and noticed that a sequence is used to enumerate the possible levels. My thought was to use symbols for factors, and its value to indicate ordering or type; simple and easy to implement. That won't enumerate all possible values though, however I can't see any use cases for knowing possible values except for error checking. If reading data from an external source, you probably aren't going to know all possibilities unless the data source happens to provide it. Rho also has a bit vector for 'used levels'; any idea what use case might be behind that?

### Tony Rossini

May 13, 2021, 2:52:34 AM5/13/21
to lisp-stat
sunsetting.   if you don't know what is possible when you subset (for example to look for subgroups with a specific property or optimal behavior) you don't know what is left out for related covariates that have been subsetted away in a multivariate exploration

On Thu, 13 May 2021, 05:43 Steve Nunez, <st...@nunez.org> wrote:
I thought I'd start with a new thread since the subject is different. I've been thinking about how best to implement factors and missing values. Missing values I think is relatively easy. There was an earlier discussion on this (under a thread named 'Julia'), and I've got it mostly done, using nil as the symbol for missing values. Room for improvement, but enough for a start.

For factors, I looked at the factor implementation in Rho, and noticed that a sequence is used to enumerate the possible levels. My thought was to use symbols for factors, and its value to indicate ordering or type; simple and easy to implement. That won't enumerate all possible values though, however I can't see any use cases for knowing possible values except for error checking. If reading data from an external source, you probably aren't going to know all possibilities unless the data source happens to provide it. Rho also has a bit vector for 'used levels'; any idea what use case might be behind that?

--
You received this message because you are subscribed to the Google Groups "Common Lisp Statistics" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lisp-stat+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lisp-stat/CAMXtoqMysMEHF7ZcpaNmLWL0nAGNYh6Qrh7%2BB5vRiDKgEAsOcQ%40mail.gmail.com.

### Steve Nunez

May 13, 2021, 3:03:22 AM5/13/21
Okay, makes sense. How do you determine all possible values of factors a priori? Or is the intention of those sequences in the Rho factor struct to act as a sort of high-water mark of all the factor values seen so far?

### Tony Rossini

May 13, 2021, 3:06:21 AM5/13/21
to lisp-stat
yes, the latter.  so you keep them and sometimes add.

for example, you start with a dataset of men, and then extend by merging with a set of women.

or you subset a dataset, and find only men in the subset, but later (with only the subsetted dataset) wonder if women were ever considered....