[SciPy-User] stats, classes instead of functions for results MovStats

josef...@gmail.com

unread,

Nov 23, 2009, 12:43:57 AM11/23/09

to SciPy Users List

Following up on a question by Keith on the numpy list and his reminder
that covariance can be calculated by the cross-product minus the
product of the means, I redid and
enhanced my moving stats functions.

Suppose x and y are two time series, then the moving correlation
requires the calculation of the mean, variance and covariance for each
window. Currently in scipy stats intermediate results are usually
thrown away on return (while rpy/R returns all intermediate results
used for the calculation.

Using a decorator/descriptor of Fernando written for nitime, I tried
out to write the function as a class instead, so that any desired (
intermediate) calculations are only made on demand, but once they are
calculated they are attached to the class as attributes or properties.
This seems to be a useful "pattern".

Are there any opinion for using the pattern in scipy.stats ? MovStats
will currently go into statsmodels

Below is the class (with cutting part of init), a full script is the
attachment, including examples that test the class.

about MovStats:
y and x are tested for 2d, either (T,N) with axis=0 or (N,T) with
axis=1, should (but may not yet) work for nd arrays along any axis
(signal.correlate docstring)
nans are handled by dropping the corresponding observations from the
window, not adding any additional observations,
not tested if a window is empty because it contains only nans, nor if
variance is zero
(kern is intended for weighted statistics in the window but not tested
yet, I still need to decide on normalization requirements)
requires scipy.signal, all calculations done with signal.correlate, no loops
as often, functions are one-liners
all results are returned for valid observations only, initial
observations with incomplete window are cut
bonus: slope of moving regression of y on x, since it was trivial to add
still some cleaning and documentation to do

usage:
ms = MovStats(x, y, axis=1)
ms.yvar
ms.xmean
ms.yxcorr
ms.yxcov
...

Josef

class MovStats(object):
def __init__(self, y, x=None, kern=5, axis=0):
self.y = y
self.x = x
if np.isscalar(kern):
ws = kern
<... snip>

@OneTimeProperty
def ymean(self):
ys = signal.correlate(self.y, self.kern, mode='same')[self.sslice]
ym = ys/self.n
return ym

@OneTimeProperty
def yvar(self):
ys2 = signal.correlate(self.y*self.y, self.kern,
mode='same')[self.sslice]
yvar = ys2/self.n - self.ymean**2
return yvar

@OneTimeProperty
def xmean(self):
if self.x is None:
return None
else:
xs = signal.correlate(self.x, self.kern, mode='same')[self.sslice]
xm = xs/self.n
return xm

@OneTimeProperty
def xvar(self):
if self.x is None:
return None
else:
xs2 = signal.correlate(self.x*self.x, self.kern,
mode='same')[self.sslice]
xvar = xs2/self.n - self.xmean**2
return xvar
@OneTimeProperty
def yxcov(self):
xys = signal.correlate(self.x*self.y, self.kern,
mode='same')[self.sslice]
return xys/self.n - self.ymean*self.xmean

@OneTimeProperty
def yxcorr(self):
return self.yxcov/np.sqrt(self.yvar*self.xvar)

@OneTimeProperty
def yxslope(self):
return self.yxcov/self.xvar

movstats.py

Pierre GM

unread,

Nov 23, 2009, 1:39:16 AM11/23/09

to SciPy Users List

Can you add support for MaskedArrays ?
The easiest would be to check whether your inputs are masked arrays. If yes, make sure they're float (transform them if needed) and fill them w/ nans as needed.
You can also check what Matt did w/ scikits.timeseries.
About your suggestion: I'd leave it in statsmodels for now...
_______________________________________________
SciPy-User mailing list
SciPy...@scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-user

josef...@gmail.com

unread,

Nov 23, 2009, 2:13:28 AM11/23/09

to SciPy Users List

Since only __init__ is affected this should be quite easy, I only need
the mask for the calculation of the number of non-nan elements in a
window, and to fill the data array with zeros. I haven't thought about
different numeric types, I guess I should make sure that also for the
non-ma arrays the calculations are done with floats.

> You can also check what Matt did w/ scikits.timeseries.

The way of calculating this, I initially got from scikits.timeseries
autocovariance, your moving_funcs are mostly in c,
cmov_window uses np.convolve which is only for 1d and needs to loop.
The advantage of scipy.signal over numpy is that it does nd
convolution.
I will look at the mask handling in time series again.

I always get mixed up with convolve versus correlate. Is there a
standard sorting for time series, up to down or left to right by
increasing time or reversed? I have to check this for non-flat window
weights/kernels.

> About your suggestion: I'd leave it in statsmodels for now...

movstat goes into statsmodels.sandbox.tsa which is my playground for
time series analysis

for scipy.stats I was thinking more of existing or other functions,
e.g. my version of groupstats, (mean, variance, demean, ... by groups)
would follow the same pattern of partly expensive calculations on
demand.

Josef

Bruce Southey

unread,

Nov 30, 2009, 4:05:03 PM11/30/09

to scipy...@scipy.org


_______________________________________________
SciPy-User mailing list
SciPy...@scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-user

I think your handling of NaN's is incorrect because you do not drop the corresponding observations. That is for two arrays

y=np.array([[ 1.229563, -0.339428, 0.83891 , 4.026574, 3.069378, 5.95668 ]])
x=np.array([[-1.236469, 1.941089, -0.346566, -0.268529, np.nan, 0.191336]])

For a windows size of 5, in the first window, the first mean and variance of y should use all 5 elements of y, the mean and variance of X should use the first 4 elements of x and, the regression and correlation coefficients should use the first 4 elements of x and y.

Some other points:
1) Your calculation of variance is susceptible to errors, see
http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
Provided that you are using sufficient precision (like numpy defaults) it is probably not that big a problem.
2) You only use the 'full' windows so when the window width is 5, you miss the first two windows and the last 2 windows. At least the mean exists in these windows and the variance in most of these partial windows. This may provided unexpected results to a user if they do not release which windows are not returned.
3) I think the user needs to define the kern argument for your MovStat class as there is probably no meaningful default value (except 42).
4) I do not know how you should handle positive and negative infinity.
5) Your code expects at least 2 dimensions so 1-d arrays fail because you can not do this assignment 'kdim[axis] = ws' with 1-d arrays.

Bruce

josef...@gmail.com

unread,

Nov 30, 2009, 4:43:55 PM11/30/09

to SciPy Users List

Thanks for checking

> I think your handling of NaN's is incorrect because you do not drop the
> corresponding observations. That is for two arrays
>
> y=np.array([[ 1.229563, -0.339428, 0.83891 , 4.026574, 3.069378, 5.95668
> ]])
> x=np.array([[-1.236469, 1.941089, -0.346566, -0.268529, np.nan,
> 0.191336]])
>
> For a windows size of 5, in the first window, the first mean and variance of
> y should use all 5 elements of y, the mean and variance of X should use the
> first 4 elements of x and, the regression and correlation coefficients
> should use the first 4 elements of x and y.

What I do currently is a compromise, I don't want to calculate mean
and variance twice. So the behavior now is, if only one array is
given, then you get the mean and variance dropping the nan
observations for that array. If two arrays are given then I drop
observations in both arrays if either one has a nan. This way the user
can choose whether they want the separate calculation.
If a user provides two arrays, my assumption is that they want cov and corr.

>
>
> Some other points:
> 1) Your calculation of variance is susceptible to errors, see
> http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
> Provided that you are using sufficient precision (like numpy defaults) it is
> probably not that big a problem.

Yes, I'm aware of this, for the example the difference to np.corrcoeff
is around
1e-14. I will add a warning to the docstring that this is designed for speed
with some precision loss. I might be able to use some preprocessing to at least
treat some badly scaled data, but the usual higher numerical precision ways
of calculating would require much slower loops. For reasonably short windows
this seems to be an acceptable tradeoff.

> 2) You only use the 'full' windows so when the window width is 5, you miss
> the first two windows and the last 2 windows. At least the mean exists in
> these windows and the variance in most of these partial windows. This may
> provided unexpected results to a user if they do not release which windows
> are not returned.

Currently I'm returning "valid" observations, that have a full window.
In a previous
version I allowed for a lag, lead, centered option, Keith returns nans, I think
scikits timeseries masks. I don't know yet which or whether these options
should be included in this function (class).

> 3) I think the user needs to define the kern argument for your MovStat class
> as there is probably no meaningful default value (except 42).

at least the window length should be specified by the user. I picked 5 for
business week mostly arbitrary to reduce typing (?)
In a slightly updated version I switched to convolution instead of correlation
to have the correct orientation for e.g. exponential weights. But I haven't
tested this yet

> 4) I do not know how you should handle positive and negative infinity.

I haven't thought about this, but since most of the time I consider inf as
a valid number, I think, it will return infs in the corresponding windows.
With a masked array, infs could be masked, but for regular arrays, I don't
want to convert infs to nans.
However, I still have to figure out some corner cases, e.g. no valid
observations
in a window, windows with zero variance. it would also be possible to
require a minimum of valid observations per window.

> 5) Your code expects at least 2 dimensions so 1-d arrays fail because you
> can not do this assignment 'kdim[axis] = ws' with 1-d arrays.

Thanks, I have tested only the 2d case. I guess I have to review
general axis handling again