On Mon, Oct 29, 2012 at 9:35 PM, Richard Styron
<
richard....@gmail.com> wrote:
> Josef, thanks again for the response. Certainly not too long. I've got
> good journal access and if you like I will send you the paper you were
> cited, and another on Peirce's criterion, which allows for multiple outlier
> rejection; it is iterative but should always converge (?) and doesn't rely
> on external inputs for location and scale, although it is designed for
> normal distributions.
If you sent them directly to me, this would be helpful.
I'm a bit skeptical about Peirce and even more Chauvet. They are
"very" old. And I haven't seen much recent (theoretical) discussion
about them.
I don't know what the underlying assumptions are, and I'm doubtful
that they control the rejection rate in the sequential testing.
http://classes.engineering.wustl.edu/2009/fall/che473/handouts/OutlierRejection.pdf
argues in favor of Peirce but without much detail.
I took their example, and get the same two outliers
>>> yy3 = np.array([101.2, 90.0, 99.0, 102.0, 103.0, 100.2, 89.0, 98.1, 101.5, 102.0])
>>> res = sm.RLM(yy3, np.ones(len(yy3)), M=sm.robust.norms.TukeyBiweight()).fit(sm.robust.scale.HuberScale())
>>> res.weights
array([ 0.99849705, 0. , 0.93362271, 0.97840215, 0.92146192,
0.99062103, 0. , 0.85939873, 0.99369045, 0.97840215])
>>> yy3[res.weights < 0.5]
array([ 90., 89.])
>>> res.scale
2.223903327758403
>>> res.params
array([ 100.91433009])
>>> np.sqrt(res.scale)
1.4912757383389577
something like the following for an outlier test based on robust mean
and variance estimation
>>> from statsmodels.stats.multitest import multipletests
sresid is standardized residual, I'm not sure it's the most appropriate
>>> pval = stats.t.sf(np.abs(res.sresid), res.df_resid)*2
>>> pval
array([ 0.90061412, 0.00083865, 0.41169426, 0.637094 , 0.3728094 ,
0.75538837, 0.00045788, 0.23747397, 0.79820926, 0.637094 ])
controls familywise error rate at alpha = 0.05
>>> multipletests(pval)
(array([False, True, False, False, False, False, True, False, False,
False], dtype=bool), array([ 0.99370536, 0.00752262, 0.96182344,
0.99370536, 0.96182344,
0.99370536, 0.00456936, 0.88570233, 0.99370536,
0.99370536]), 0.0051161968918237433, 0.0050000000000000001)
controls false discovery rate at alpha = 0.05
>>> multipletests(pval, method='fdr_bh')
(array([False, True, False, False, False, False, True, False, False,
False], dtype=bool), array([ 0.90061412, 0.00419327, 0.82338852,
0.88689918, 0.82338852,
0.88689918, 0.00419327, 0.7915799 , 0.88689918,
0.88689918]), 0.0051161968918237433, 0.0050000000000000001)
>>> res_mt = multipletests(pval, method='fdr_bh')
>>> yy3[res_mt[0]] #outlier values
array([ 90., 89.])
The advantage of this approach is that we get a robust estimate of
mean and variance. I think in the iterative methods and sequential
testing are more likely to get stuck when several outliers mask each
other.
IIRC, TukeyBiweight can be sensitive to starting values, and we might
have to try a different starting methods or values. (In the
regression setting we could use an MM-estimator, which just puts a
different starting method in front, but I don't know if that is
relevant in the constant mean case.)
(i just wrote this in a python session, I'm pretty sure it works, but
will take more time to verify it's correct.)
Josef