On Wed, Jun 29, 2016 at 4:50 AM, <
josef...@gmail.com> wrote:
>
>
> On Wed, Jun 29, 2016 at 5:00 AM, Nathaniel Smith <
n...@vorpus.org> wrote:
>>
>> Sorry, what's the problem?
>
>
> Maybe just early morning grumpiness finding the first real bug of 0.8.0.rc1.
>
> patsy defaults to drop missing, while statsmodels prefers "propagate" or
> "ignore".
When doing model fitting as well, or just when doing prediction?
> (and AFAICS, there is still no simple string option in dmatrix nan/missing
> handling to "propagate" or "ignore".
Unfortunately this gets very messy very fast :-(. I think you can make
it work if you're willing to forever lock yourself into having
hard-coded special-case code implementing pandas's model of always
using nan to represent missing values (even for string-typed data
where it doesn't make much sense, and even if those nans represent
computational errors instead of data that's actually missing which is
simply wrong, etc.), but I'm a bit reluctant...
> In this case, we keep the index of the dataframe provided by the user, but
> patsy returns the array with missing rows removed. (The "missing" rows are
> missing, and pandas complains when building the return Series.)
Note that if patsy is passed a DataFrame for the inputs, and you do
return_type="dataframe", then it automatically propagates the input
index through to the output, so you may not need to worry about this
in the first place...
FWIW, patsy is also careful to handle dropped rows correctly during
this process, e.g. notice the index on the return value here that lets
you automatically match up output rows to input rows despite some of
them being dropped:
In [5]: dmatrix("~ x", pd.DataFrame(dict(x=[1.1, np.nan, 1.2])),
return_type="dataframe")
Out[5]:
Intercept x
0 1.0 1.1
2 1.0 1.2
I guess one way to avoid hard-coded checks for nans deep inside the
categorical code would be to have an NA_action of "replace_with_NaN",
which just fills in any removed rows with all-NaN, like:
In [5]: dmatrix("~ x", pd.DataFrame(dict(x=[1.1, np.nan, 1.2])),
NA_action="replace_with_NaN", return_type="dataframe")
Out[5]:
Intercept x
0 1.0 1.1
1 nan nan
2 1.0 1.2
> I still don't remember to pay attention to missing behavior when reviewing
> pull requests. Without test coverage for pandas+nan, this slipped through.
> One of the disadvantages of having many options is that it is almost
> impossible to provide unit tests for all combinations at an initial merge.
Yeah, that is definitely challenging.
-n