segfault with pandas timeindex and ols ?

104 views
Skip to first unread message

josef...@gmail.com

unread,
Dec 9, 2012, 8:54:46 AM12/9/12
to pystatsmodels
http://stackoverflow.com/questions/13786209/regression-on-stock-data-using-pandas-and-matplotlib

from the answer
"The segfault comes from trying to us the Datetime index as the
exogenous variable"

Can someone replicate the segfault, and see where it is caused?

Thanks,

Josef

Skipper Seabold

unread,
Dec 10, 2012, 9:55:18 AM12/10/12
to pystat...@googlegroups.com
I cannot replicate with pandas 0.9.1 with or without --pylab.

Skipper

Wes McKinney

unread,
Dec 10, 2012, 3:53:31 PM12/10/12
to pystat...@googlegroups.com
On Mon, Dec 10, 2012 at 3:14 PM, Tom Augspurger
<tom.augs...@gmail.com> wrote:
> I was the one who posted that answer. I'm able to reproduce it each time I
> try to run his code. As I suggested on SO, when the index is not the
> Datetime index he was using, the crash is avoided.
>
> I'd be happy to help you debug it, but I'd be more of a test subject. I've
> got no idea why it's occurring. To start things off, I'm running Mac OS
> 10.8.2, python 2.7.
>
> matplotlib version: 1.2.x
> sm version: 0.5.0.dev-bdf0c45
> pandas version: 0.9.1.dev-f391180
> I think I'm on version .13.1 of ipython.
>
> Let me know if I how else I can help. I've attached uploaded the crash
> report here: https://gist.github.com/ade907a67b65c0df5ddc
What NumPy version?

Tom Augspurger

unread,
Dec 10, 2012, 4:47:44 PM12/10/12
to pystat...@googlegroups.com
Sorry missed that one:
In [62]: np.version.full_version
Out[62]: '1.6.2'

josef...@gmail.com

unread,
Dec 10, 2012, 5:13:11 PM12/10/12
to pystat...@googlegroups.com
On Mon, Dec 10, 2012 at 4:47 PM, Tom Augspurger
A general question, (since I don't find my python with the matching versions)

What happens with np.asarray with other dtypes in numpy?

My guess is that to be more robust we should switch to asarray with
float dtype before calling linalg functions.
I assumed so far that the linalg functions have enough checks already in there.

I don't know if there are cases where casting to float is not good,
numeric dtypes with less than float precision,
complex, automatic differentiation types?

Josef

Tom Augspurger

unread,
Dec 10, 2012, 5:34:42 PM12/10/12
to pystat...@googlegroups.com
Internet went out here (this is on my phone) so I'll post more later.

Basically I've reproduced the crash when using pandas date_range and datetimeIndex objects as exogenous variables in OLS. The crash occurs doesn't occur when I call .fit() but does when I do something with the results, like print them.

What other dtypes should I check?

josef...@gmail.com

unread,
Dec 10, 2012, 5:56:33 PM12/10/12
to pystat...@googlegroups.com
I have it now also on Windows in a virtualenv with numpy 1.6.2 and
pandas.__version__ = '0.9.0'

>>> res.model.exog.dtype
dtype('datetime64[ns]')
>>> res._results.params
array([ 7.28078437e-17])

res.params raises exception

accessing fittedvalues segfaults

Josef

josef...@gmail.com

unread,
Dec 10, 2012, 6:43:47 PM12/10/12
to pystat...@googlegroups.com
numpy bug in 1.6.
>>> np.dot(res.model.exog, [1.]).shape

segfaults
dot doesn't seem to like dtype('datetime64[ns]')

My guess is that we need to explicitly check the dtype of endog and
exog and raise an exception in this case.
conversion to float doesn't seem to make sense in this example (?)
>>> np.asarray(res.model.exog, float)[:5]
array([[ 1.06479360e+18],
[ 1.06488000e+18],
[ 1.06496640e+18],
[ 1.06505280e+18],
[ 1.06513920e+18]])


>
> Josef
>>
>> What other dtypes should I check?

I don't know if anyone ever tried an object array.

Josef

josef...@gmail.com

unread,
Dec 10, 2012, 6:53:05 PM12/10/12
to pystat...@googlegroups.com
On Mon, Dec 10, 2012 at 6:43 PM, <josef...@gmail.com> wrote:
> On Mon, Dec 10, 2012 at 5:56 PM, <josef...@gmail.com> wrote:
>> On Mon, Dec 10, 2012 at 5:34 PM, Tom Augspurger
>> <tom.augs...@gmail.com> wrote:
>>> Internet went out here (this is on my phone) so I'll post more later.
>>>
>>> Basically I've reproduced the crash when using pandas date_range and datetimeIndex objects as exogenous variables in OLS. The crash occurs doesn't occur when I call .fit() but does when I do something with the results, like print them.
>>
>> I have it now also on Windows in a virtualenv with numpy 1.6.2 and
>> pandas.__version__ = '0.9.0'
>>
>>>>> res.model.exog.dtype
>> dtype('datetime64[ns]')
>>>>> res._results.params
>> array([ 7.28078437e-17])
>>
>> res.params raises exception
>>
>> accessing fittedvalues segfaults
>
> numpy bug in 1.6.
>>>> np.dot(res.model.exog, [1.]).shape
>
> segfaults
> dot doesn't seem to like dtype('datetime64[ns]')

just another check: pandas is also not to blame

>>> sp500.index[:10]
<class 'pandas.tseries.index.DatetimeIndex'>
[2003-09-29 00:00:00, ..., 2003-10-10 00:00:00]
Length: 10, Freq: None, Timezone: None
>>> a = np.asarray(sp500.index)
>>> a[:10]
array([1970-01-13 96:00:00, 1970-01-13 120:00:00, 1970-01-13 144:00:00,
1970-01-13 168:00:00, 1970-01-13 192:00:00, 1970-01-13 08:00:00,
1970-01-13 32:00:00, 1970-01-13 56:00:00, 1970-01-13 80:00:00,
1970-01-13 104:00:00], dtype=datetime64[ns])
>>> np.dot(a, [1])
<booom>

Josef

Michael Aye

unread,
Dec 12, 2012, 9:06:01 PM12/12/12
to pystat...@googlegroups.com
But the datetime64 is an exclusive pandas object-type, isn't it? Maybe Wes does some trickery that np.dot doesn't understand / like?

Michael 

josef...@gmail.com

unread,
Dec 12, 2012, 9:41:33 PM12/12/12
to pystat...@googlegroups.com
No, it's a numpy dtype that was introduced in numpy 1.6 and has some
bugs in that version.
I was able to cause the segfault with only numpy, no pandas or
statsmodels involved,

The bug in numpy is fixed in the current beta for the numpy 1.7.0 release.

I had moved the question to the numpy mailing list, and forgot to add
the conclusions here.

Josef

josef...@gmail.com

unread,
Dec 12, 2012, 9:50:54 PM12/12/12
to pystat...@googlegroups.com
And a thank you to Tom, bmu and Ben.

Once I had the information which versions of numpy and that it doesn't
crash in fit(), it was quite easy to find out.

Josef

Skipper Seabold

unread,
Dec 13, 2012, 8:19:06 AM12/13/12
to pystat...@googlegroups.com
And to you for following up and getting to the bottom of this. If at all possible, I think we should be directing people away from 1.6.x if they're use cases involve datetime. Wes went to great lengths to assure that the "right thing" is done in pandas even though datetime was not polished in 1.6.x, but it's mighty confusing for users.

I'm not sure entirely sure what we should do here yet. Is there a github issue for consistent/configurable type handling?

Skipper

josef...@gmail.com

unread,
Dec 13, 2012, 8:52:49 AM12/13/12
to pystat...@googlegroups.com
The problem is that 1.7.0 is not released yet, so 1.6.2 is still the
latest release.
Point users to the 1.7.0 beta if they are relying heavily on datetime?

>
> I'm not sure entirely sure what we should do here yet. Is there a github
> issue for consistent/configurable type handling?

As far as I remember, we never worried about dtypes before, there is
only the current issue
https://github.com/statsmodels/statsmodels/issues/586

For now I would add an explicit dtype check for datetime in endog,
exog, into the data handling, and raise a ValueError. I guess also for
dtype=object.
With other dtypes, float32, complex, I would wait for problems or let
it raise whichever exception gets raised when it doesn't work, or let
it just work.

(I'm curious about complex linear algebra, but it needs a recent scipy
for svdvals, and I haven't tried yet.)

Josef


>
> Skipper
Reply all
Reply to author
Forward
0 new messages