Re: [pydata] Covariance matrix not positive semi-definite

josef...@gmail.com

unread,

Apr 30, 2013, 1:57:16 PM4/30/13

to pyd...@googlegroups.com

On Tue, Apr 30, 2013 at 1:21 PM, Paul Blelloch <paul.b...@gmail.com> wrote:
> I'm relatively new to pandas and thought that I'd start by processing some
> stock data as an example. I read in a bunch of stock data into a Panel
> using the read_data_yahoo function, and then calculated and stored the
> returns in a DataFrame. I then used the .cov() method to calculate the
> covariance. What I found was that though the covariance was symmetric it
> wasn't positive semi-definite. This wrecked havoc with some optimziation
> routines that I was running. Should the covariance be positive
> semi-definite? Is there any way to modify the calculation so that it is.
> Here's a relevant snippet of code:
>
> from pandas.io.data import *
>
> from pandas import *
>
> from numpy.linalg import eig
>
> print 'Reading stock data from Yahoo Finance'
>
> symbols = ['TRBCX', 'CMTFX', 'TREMX', 'PRFDX', 'PEXMX', 'PRITX', 'PRLAX',
>
> 'RPMGX', 'TRMCX', 'PRASX', 'PRNHX', 'OPGSX', 'TRREX', 'PRSCX',
>
> 'PRSVX', 'PRSGX', 'PSILX', 'PRHYX', 'PTTRX', 'RPSIX', 'PRTIX',
>
> 'TRRFX', 'TRRAX', 'TRRGX', 'TRRBX', 'TRRHX', 'TRRCX', 'TRRJX',
>
> 'TRRDX', 'TRRKX', 'TRRMX', 'TRRNX', 'TRRIX'] # List of all stock symbols to
> download
>
> stock_data = get_data_yahoo(symbols,start='1/1/1900') # Download data from
> YAHOO as a pandas Panel object
>
> adj_close = stock_data['Adj Close'] # Pull out adjusted closing prices as
> pandas DataFrame object
>
> returns = adj_close/adj_close.shift(1)-1 # Calculate simple returns
>
> covariance = returns.cov().values # Return the covariance matrix
>
> e = eig(covariance) # Check that all eigenvalues are positive
>
> print e[0] # Print eigenvalues

As far as I know, if you have missing values, and do only pairwise
deletion (instead of listwise), then positive semi-definiteness is not
guaranteed.

https://github.com/statsmodels/statsmodels/pull/631

It's not directly the algorithm of the reference in
https://github.com/statsmodels/statsmodels/issues/303

Josef

>
> --
> You received this message because you are subscribed to the Google Groups
> "PyData" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to pydata+un...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Skipper Seabold

unread,

Apr 30, 2013, 1:56:43 PM4/30/13

to pyd...@googlegroups.com

My guess is pandas' NA-behavior here is too clever by half and does
pair-wise dropping of NAs, which I'd argue is probably wrong. Consider
this.

covariance = returns.dropna().cov().values
e = eig(covariance)
print e[0]

Skipper

Skipper Seabold

unread,

Apr 30, 2013, 2:08:30 PM4/30/13

to pyd...@googlegroups.com

Ah right, then I retract my comment about it being wrong. Maybe a note
in the docstring (with a see also when your work is merged)?

Skipper

Robert Kern

unread,

Apr 30, 2013, 2:48:54 PM4/30/13

to pyd...@googlegroups.com

On Tue, Apr 30, 2013 at 7:08 PM, Skipper Seabold <jsse...@gmail.com> wrote:

> On Tue, Apr 30, 2013 at 1:57 PM, <josef...@gmail.com> wrote:

>> As far as I know, if you have missing values, and do only pairwise
>> deletion (instead of listwise), then positive semi-definiteness is not
>> guaranteed.
>>
>> https://github.com/statsmodels/statsmodels/pull/631
>>
>> It's not directly the algorithm of the reference in
>> https://github.com/statsmodels/statsmodels/issues/303
>>
>
> Ah right, then I retract my comment about it being wrong. Maybe a note
> in the docstring (with a see also when your work is merged)?

Oh, it's still wrong. ;-)

If pairwise deletion does not guarantee positive semi-definiteness,
then that just means that pairwise deletion cannot be used to
implement .cov() without further postprocessing. Personally, I prefer
to keep pairwise deletion and its postprocessing separate from the
default behavior of methods named ".cov()", but given that the
pairwise behavior is documented, I would simply recommend adding some
documentation about the consequences of that behavior, namely that the
resulting matrix is not a covariance matrix and may require extra
processing for many applications.

--
Robert Kern

Skipper Seabold

unread,

Apr 30, 2013, 2:58:29 PM4/30/13

to pyd...@googlegroups.com

On Tue, Apr 30, 2013 at 2:48 PM, Robert Kern <rober...@gmail.com> wrote:
> On Tue, Apr 30, 2013 at 7:08 PM, Skipper Seabold <jsse...@gmail.com> wrote:
>> On Tue, Apr 30, 2013 at 1:57 PM, <josef...@gmail.com> wrote:
>
>>> As far as I know, if you have missing values, and do only pairwise
>>> deletion (instead of listwise), then positive semi-definiteness is not
>>> guaranteed.
>>>
>>> https://github.com/statsmodels/statsmodels/pull/631
>>>
>>> It's not directly the algorithm of the reference in
>>> https://github.com/statsmodels/statsmodels/issues/303
>>>
>>
>> Ah right, then I retract my comment about it being wrong. Maybe a note
>> in the docstring (with a see also when your work is merged)?
>
> Oh, it's still wrong. ;-)

Well, let me say then that my drop all observations covariance isn't
necessarily right since that reference points out that it's
inconsistent.

>
> If pairwise deletion does not guarantee positive semi-definiteness,
> then that just means that pairwise deletion cannot be used to
> implement .cov() without further postprocessing. Personally, I prefer
> to keep pairwise deletion and its postprocessing separate from the
> default behavior of methods named ".cov()", but given that the
> pairwise behavior is documented, I would simply recommend adding some
> documentation about the consequences of that behavior, namely that the
> resulting matrix is not a covariance matrix and may require extra
> processing for many applications.
>

I tend to agree with the keeping 'advanced' estimators separate from
"cov" since there are likely different ways to handle this (I don't
know the literature), and users should be forced to think in these
cases what they really want.

The documentation on NA-handling also isn't crystal clear (to me).

Skipper

Paul Blelloch

unread,

Apr 30, 2013, 3:32:07 PM4/30/13

to pyd...@googlegroups.com

Thank you for all your replies. It appears that I generated a non-trivial discussion.

I'm happy to just drop data from any row where any of my columns is missing data (I'll still have plenty of data to do meaningful statistics). Then everything should be consistent, and presumably I'd get a positive semi-definite covariance. Is there a simple method in pandas to slice out just those rows that have data in all columns?

Skipper Seabold

unread,

Apr 30, 2013, 3:40:03 PM4/30/13

to pyd...@googlegroups.com

On Tue, Apr 30, 2013 at 3:32 PM, Paul Blelloch <paul.b...@gmail.com> wrote:
> Is there a simple method in pandas to slice out just those rows that have data in
> all columns?

df = df.dropna()

or

idx = df.dropna().index
df.ix[idx]

should work

Skipper

Jeff

unread,

Apr 30, 2013, 3:45:17 PM4/30/13

to pyd...@googlegroups.com

FYI there is also a min_periods argument to cov to require a minimum number of obs (these are joint

non-nan objs of the 2) or it will get nan

see

http://pandas.pydata.org/pandas-docs/dev/computation.html#covariance

Paul Blelloch

unread,

Apr 30, 2013, 4:00:08 PM4/30/13

to pyd...@googlegroups.com

THANKS! That worked like a charm. I do think that pandas is fantastic.

Reply all

Reply to author

Forward