too much R imitation

103 views
Skip to first unread message

josef...@gmail.com

unread,
Jun 29, 2016, 4:16:27 AM6/29/16
to pystatsmodels
https://github.com/statsmodels/statsmodels/issues/3087

Douglas Bates is still arguing for making `1 +` required in Julia's formulas.



Nathaniel Smith

unread,
Jun 29, 2016, 5:00:15 AM6/29/16
to pystatsmodels
Sorry, what's the problem?
--
Nathaniel J. Smith -- https://vorpus.org

josef...@gmail.com

unread,
Jun 29, 2016, 7:50:48 AM6/29/16
to pystatsmodels
On Wed, Jun 29, 2016 at 5:00 AM, Nathaniel Smith <n...@vorpus.org> wrote:
Sorry, what's the problem?

Maybe just early morning grumpiness finding the first real bug of 0.8.0.rc1.

patsy defaults to drop missing, while statsmodels prefers "propagate" or "ignore".
(and AFAICS, there is still no simple string option in dmatrix nan/missing handling to "propagate" or "ignore".

In this case, we keep the index of the dataframe provided by the user, but patsy returns the array with missing rows removed. (The "missing" rows are missing, and pandas complains when building the return Series.)

I still don't remember to pay attention to missing behavior when reviewing pull requests. Without test coverage for pandas+nan, this slipped through. One of the disadvantages of having many options is that it is almost impossible to provide unit tests for all combinations at an initial merge.

Josef

Nathaniel Smith

unread,
Jun 29, 2016, 11:51:58 PM6/29/16
to pystatsmodels
On Wed, Jun 29, 2016 at 4:50 AM, <josef...@gmail.com> wrote:
>
>
> On Wed, Jun 29, 2016 at 5:00 AM, Nathaniel Smith <n...@vorpus.org> wrote:
>>
>> Sorry, what's the problem?
>
>
> Maybe just early morning grumpiness finding the first real bug of 0.8.0.rc1.
>
> patsy defaults to drop missing, while statsmodels prefers "propagate" or
> "ignore".

When doing model fitting as well, or just when doing prediction?

> (and AFAICS, there is still no simple string option in dmatrix nan/missing
> handling to "propagate" or "ignore".

Unfortunately this gets very messy very fast :-(. I think you can make
it work if you're willing to forever lock yourself into having
hard-coded special-case code implementing pandas's model of always
using nan to represent missing values (even for string-typed data
where it doesn't make much sense, and even if those nans represent
computational errors instead of data that's actually missing which is
simply wrong, etc.), but I'm a bit reluctant...

> In this case, we keep the index of the dataframe provided by the user, but
> patsy returns the array with missing rows removed. (The "missing" rows are
> missing, and pandas complains when building the return Series.)

Note that if patsy is passed a DataFrame for the inputs, and you do
return_type="dataframe", then it automatically propagates the input
index through to the output, so you may not need to worry about this
in the first place...

FWIW, patsy is also careful to handle dropped rows correctly during
this process, e.g. notice the index on the return value here that lets
you automatically match up output rows to input rows despite some of
them being dropped:

In [5]: dmatrix("~ x", pd.DataFrame(dict(x=[1.1, np.nan, 1.2])),
return_type="dataframe")
Out[5]:
Intercept x
0 1.0 1.1
2 1.0 1.2

I guess one way to avoid hard-coded checks for nans deep inside the
categorical code would be to have an NA_action of "replace_with_NaN",
which just fills in any removed rows with all-NaN, like:

In [5]: dmatrix("~ x", pd.DataFrame(dict(x=[1.1, np.nan, 1.2])),
NA_action="replace_with_NaN", return_type="dataframe")
Out[5]:
Intercept x
0 1.0 1.1
1 nan nan
2 1.0 1.2

> I still don't remember to pay attention to missing behavior when reviewing
> pull requests. Without test coverage for pandas+nan, this slipped through.
> One of the disadvantages of having many options is that it is almost
> impossible to provide unit tests for all combinations at an initial merge.

Yeah, that is definitely challenging.

-n

josef...@gmail.com

unread,
Jun 30, 2016, 4:21:58 AM6/30/16
to pystatsmodels
On Wed, Jun 29, 2016 at 11:51 PM, Nathaniel Smith <n...@vorpus.org> wrote:
On Wed, Jun 29, 2016 at 4:50 AM,  <josef...@gmail.com> wrote:
>
>
> On Wed, Jun 29, 2016 at 5:00 AM, Nathaniel Smith <n...@vorpus.org> wrote:
>>
>> Sorry, what's the problem?
>
>
> Maybe just early morning grumpiness finding the first real bug of 0.8.0.rc1.
>
> patsy defaults to drop missing, while statsmodels prefers "propagate" or
> "ignore".

When doing model fitting as well, or just when doing prediction?

We currently turn off patsy's nan/missing checking in at least some of the model code but IIRC not by default.
In tsa we should turn it off completely, because drop is essentially never the right answer, but I think we don't support formulas there yet. Some statespace models handle nans internally in the statistically correct way.



> (and AFAICS, there is still no simple string option in dmatrix nan/missing
> handling to "propagate" or "ignore".

Unfortunately this gets very messy very fast :-(. I think you can make
it work if you're willing to forever lock yourself into having
hard-coded special-case code implementing pandas's model of always
using nan to represent missing values (even for string-typed data
where it doesn't make much sense, and even if those nans represent
computational errors instead of data that's actually missing which is
simply wrong, etc.), but I'm a bit reluctant...

I worry about other missing values when they show up in any numpy or pandas implementation.
My guess is that we will use more masks like maske_arrays for special cases.

The return from dmatrix that we want is always at least float, so nans are fine as indicator for observations that are not available.
I was thinking that this should apply only on a term basis. E.g. if a categorical is nan, then the row has nans in all columns of the term, and as consequence in all interactions with it..


> I still don't remember to pay attention to missing behavior when reviewing
> pull requests. Without test coverage for pandas+nan, this slipped through.
> One of the disadvantages of having many options is that it is almost
> impossible to provide unit tests for all combinations at an initial merge.

Yeah, that is definitely challenging.

The other problem for me is that missing and formulas was all Skipper's and I avoided getting much into the details.
(And as I'm getting older I'm getting less patient, and want to concentrate on the statistics and econometrics that is still missing.)

Josef
Reply all
Reply to author
Forward
0 new messages