outlier rejection (preferably with Peirce's criterion) with pandas/statsmodels

1,573 views
Skip to first unread message

Richard Styron

unread,
Oct 29, 2012, 1:08:50 PM10/29/12
to pystat...@googlegroups.com
Hi everyone,

I have a dataset consisting of hundreds of geochemical samples with 3-10 replicate analyses of each that are in a Pandas dataframe.  Because of the commonality of impurities in the sample, quite frequently one or more measurements need to be excluded before summary statistics (mean, std) can be used.  I have previously used Chauvenet's and Peirce's criteria for outlier detection and removal, and prefer Peirce's criterion, but certainly others exist.  It looks like statsmodels has outlier detection methods for x,y data (maybe not in production yet?), but what about for repeated measurements of a single quantity?

Thanks,
Richard

josef...@gmail.com

unread,
Oct 29, 2012, 1:33:05 PM10/29/12
to pystat...@googlegroups.com
Much of our robust methods and outlier detection are for linear
regression. statsmodels.robust also has robust estimates for location
and scale for single/univariate variable.

I think what can be used depends on what assumption you make on the
joint structure of your data.

For example,
Are all of your repeated samples different, or is there a common mean
and/or variance or outlier pattern?

If they are all different, then it's just a univariate estimate for
the variance and robust location scale might help but, I guess, will
not be very good with only 3 sample points.

If there is a common variance, then it could be estimated very well (I
guess) from the cross-section and outliers could be identified in a
more reliable way.
one possibility: use RLM to regress on indicator variables for the
different means across samples (essentially fixed effects estimation
in a panel data set.

Can you post just a few samples to get a better idea about what the
data looks like.

Josef



>
> Thanks,
> Richard

Richard Styron

unread,
Oct 29, 2012, 2:45:20 PM10/29/12
to pystat...@googlegroups.com
Josef,

Thanks for the response.  A bit more about the data, which may explain some things:  These data are analysis of individual crystals from a piece of rock that are basically little radiometric clocks that tell us how long ago the samples cooled below a certain temperature range (such as 150 - 170 C).  Because these are geologic samples from a complicated area (the Andes), some rocks cooled not that long ago, relatively, such as in the past 40 million years, while some cooled maybe 2000 million years ago, and anywhere in between.  Furthermore, some of the samples cooled 400 million years ago and then were reheated to within the critical temperature range, and some of the clocks were reset and others were not (based on variation in crystal size, concentrations of uranium, thorium, chlorine, etc.).  

Measurements of laboratory standards show that the variance is a somewhat linear function of the age, so that the expected standard deviation is 4% of the age, for one type of measurement.  However this is for 'perfect' samples that are used as laboratory standards, and for real samples the standard deviation may be closer to 10%.  It is possible but not known for sure that the variance may be a nonlinear function of age, for example an exponential so that over the past 50 million years it is something like 7% and for samples 250 million years old it is 20%.  But this has not really been studied rigorously as far as I know.

It is also more common for spurious measurements to be older than younger, especially for young samples.

A well-behaved 'young' sample may have these measurements (numbers are millions of years ago): [36.2, 28.1, 30.7, 26.5]

A bit more poorly behaved 'young' samples: [21.6, 12.6, 16.8, 21.2] and [150.4, 28.8, 46.6, 40.2, 46.5]

A well-behaved 'old' sample may have these measurements: [436.5, 471.0, 128.4, 285.9]

Some 'partially reset' samples look like this: [115.5, 15.6, 98.0, 243.7, 13.6]

A few samples have some impossible ages (such as -65,506 meaning the rock cooled 65 billion years in the future)

Linear regression is probably not applicable because the age distributions are functions of latitude, longitude and altitude, and there are large discontinuities in the age patterns because of geologic faults in the study region.  We use 3D finite element modeling with iterative forward modeling in order to understand the age distributions through space and time.  I am just trying to remove the obviously bad measurements before running the finite element models.  Maybe the univariate statistics would be good?  


Thanks again,
Richard

josef...@gmail.com

unread,
Oct 29, 2012, 3:53:01 PM10/29/12
to pystat...@googlegroups.com
out of the box, something like this gives you a continuous measure of
outlyingness
inliers should have weights close to one, outliers have low weights

>>> yy = np.array([436.5, 471.0, 128.4, 285.9])
>>> res = sm.RLM(yy, np.ones(len(yy)), M=sm.robust.norms.TukeyBiweight()).fit(sm.robust.scale.HuberScale())
>>> res.weights
array([ 0.96986092, 0.94647781, 0.88149179, 0.99331732])
>>> yy2 = [150.4, 28.8, 46.6, 40.2, 46.5]
>>> res = sm.RLM(yy2, np.ones(len(yy2)), M=sm.robust.norms.TukeyBiweight()).fit(sm.robust.scale.HuberScale())
>>> res.weights
array([ 0. , 0.81520131, 0.95897607, 0.99921262, 0.9604072 ])

However, this doesn't give you a test. In examples I usually pick some
cutoff depending how strict I want to be

>>> res.weights < 0.9
array([ True, True, False, False, False], dtype=bool)
>>> res.weights < 0.5
array([ True, False, False, False, False], dtype=bool)

To construct a test we need confidence intervals, or a t-test with
robust scale and location. We don't have that out of the box, except
as a recently added method to OLS Results.

Other issue: multiple testing If you need to identify possibly many
outliers, then we need to correct for multiple test.
We have several methods for p-value correction, but again would need
to be connected

for OLS, but doesn't work with result from RLM:
http://statsmodels.sourceforge.net/devel/generated/statsmodels.regression.linear_model.OLSResults.outlier_test.html#statsmodels.regression.linear_model.OLSResults.outlier_test

similar multiplicity correction:
http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h3.htm

>>> res_ols = sm.OLS(yy2, np.ones(len(yy2))).fit()
>>> res_ols.params
array([ 62.5])
>>> res_ols.scale
2467.0499999999997
>>> res_ols.outlier_test()
array([[ 11.74089128, 0.00132782, 0.00663912],
[ -0.70999204, 0.52892132, 1. ],
[ -0.31503684, 0.773375 , 1. ],
[ -0.44908625, 0.68382838, 1. ],
[ -0.31708439, 0.77196528, 1. ]])
>>> res_ols = sm.OLS(yy, np.ones(len(yy))).fit()
>>> res_ols.outlier_test()
array([[ 0.71409078, 0.54926322, 1. ],
[ 1.05342159, 0.40263032, 1. ],
[-2.37024861, 0.14124124, 0.56496497],
[-0.27271141, 0.8106523 , 1. ]])

I think using OLS doesn't estimate the scale without outliers, so this
might be off if there are many outliers in the sample.

----------
general:

we don't have currently robust estimation with non-linear models.
But it might be possible to get a version that is linear in parameters
by using a non-linear transformation, and for example a polynomial for
the regressors (age, ....)

I think we could add Grubbs and "Generalized ESD Test for Outliers"
from the NIST pages, using robust location and scale estimates.
But I don't have an example yet how to bring these parts together.

Hope that helps so far,

Josef

josef...@gmail.com

unread,
Oct 29, 2012, 8:42:41 PM10/29/12
to pystat...@googlegroups.com
Given that we are missing the simpler constant mean versions, and
since I found some parts that currently don't work for this case (we
never had this use case directly in mind when writing the robust and
outlier code for linear regression), I opened

https://github.com/statsmodels/statsmodels/issues/550 551 and 552

I tried to find something on the robustness properties of outlier
tests, but didn't find much, except in a journal I don't have access
to
http://ascelibrary.org/doi/abs/10.1061/%28ASCE%290733-9453%281997%29123%3A1%2815%29?journalCode=jsued2
http://ascelibrary.org/doi/abs/10.1061/%28ASCE%290733-9453%281999%29125%3A2%2869%29

(There is a much larger literature on multivariate outlier detection,
with constant mean and variance - no regression. This is currently
more the area of scikit-learn.)

Sorry, if my answers are too long, I just like to get an overview over
where we are with this use case.

Josef

Richard Styron

unread,
Oct 29, 2012, 9:35:25 PM10/29/12
to pystat...@googlegroups.com
Josef, thanks again for the response.  Certainly not too long.  I've got good journal access and if you like I will send you the paper you were cited, and another on Peirce's criterion, which allows for multiple outlier rejection; it is iterative but should always converge (?) and doesn't rely on external inputs for location and scale, although it is designed for normal distributions.


Richard

josef...@gmail.com

unread,
Oct 30, 2012, 12:19:28 AM10/30/12
to pystat...@googlegroups.com
On Mon, Oct 29, 2012 at 9:35 PM, Richard Styron
<richard....@gmail.com> wrote:
> Josef, thanks again for the response. Certainly not too long. I've got
> good journal access and if you like I will send you the paper you were
> cited, and another on Peirce's criterion, which allows for multiple outlier
> rejection; it is iterative but should always converge (?) and doesn't rely
> on external inputs for location and scale, although it is designed for
> normal distributions.

If you sent them directly to me, this would be helpful.

I'm a bit skeptical about Peirce and even more Chauvet. They are
"very" old. And I haven't seen much recent (theoretical) discussion
about them.
I don't know what the underlying assumptions are, and I'm doubtful
that they control the rejection rate in the sequential testing.

http://classes.engineering.wustl.edu/2009/fall/che473/handouts/OutlierRejection.pdf
argues in favor of Peirce but without much detail.

I took their example, and get the same two outliers

>>> yy3 = np.array([101.2, 90.0, 99.0, 102.0, 103.0, 100.2, 89.0, 98.1, 101.5, 102.0])
>>> res = sm.RLM(yy3, np.ones(len(yy3)), M=sm.robust.norms.TukeyBiweight()).fit(sm.robust.scale.HuberScale())
>>> res.weights
array([ 0.99849705, 0. , 0.93362271, 0.97840215, 0.92146192,
0.99062103, 0. , 0.85939873, 0.99369045, 0.97840215])
>>> yy3[res.weights < 0.5]
array([ 90., 89.])
>>> res.scale
2.223903327758403
>>> res.params
array([ 100.91433009])
>>> np.sqrt(res.scale)
1.4912757383389577

something like the following for an outlier test based on robust mean
and variance estimation

>>> from statsmodels.stats.multitest import multipletests

sresid is standardized residual, I'm not sure it's the most appropriate

>>> pval = stats.t.sf(np.abs(res.sresid), res.df_resid)*2
>>> pval
array([ 0.90061412, 0.00083865, 0.41169426, 0.637094 , 0.3728094 ,
0.75538837, 0.00045788, 0.23747397, 0.79820926, 0.637094 ])


controls familywise error rate at alpha = 0.05

>>> multipletests(pval)
(array([False, True, False, False, False, False, True, False, False,
False], dtype=bool), array([ 0.99370536, 0.00752262, 0.96182344,
0.99370536, 0.96182344,
0.99370536, 0.00456936, 0.88570233, 0.99370536,
0.99370536]), 0.0051161968918237433, 0.0050000000000000001)

controls false discovery rate at alpha = 0.05

>>> multipletests(pval, method='fdr_bh')
(array([False, True, False, False, False, False, True, False, False,
False], dtype=bool), array([ 0.90061412, 0.00419327, 0.82338852,
0.88689918, 0.82338852,
0.88689918, 0.00419327, 0.7915799 , 0.88689918,
0.88689918]), 0.0051161968918237433, 0.0050000000000000001)

>>> res_mt = multipletests(pval, method='fdr_bh')
>>> yy3[res_mt[0]] #outlier values
array([ 90., 89.])

The advantage of this approach is that we get a robust estimate of
mean and variance. I think in the iterative methods and sequential
testing are more likely to get stuck when several outliers mask each
other.
IIRC, TukeyBiweight can be sensitive to starting values, and we might
have to try a different starting methods or values. (In the
regression setting we could use an MM-estimator, which just puts a
different starting method in front, but I don't know if that is
relevant in the constant mean case.)

(i just wrote this in a python session, I'm pretty sure it works, but
will take more time to verify it's correct.)

Josef

josef...@gmail.com

unread,
Oct 31, 2012, 12:27:22 PM10/31/12
to pystat...@googlegroups.com
On Tue, Oct 30, 2012 at 12:19 AM, <josef...@gmail.com> wrote:
> On Mon, Oct 29, 2012 at 9:35 PM, Richard Styron
> <richard....@gmail.com> wrote:
>> Josef, thanks again for the response. Certainly not too long. I've got
>> good journal access and if you like I will send you the paper you were
>> cited, and another on Peirce's criterion, which allows for multiple outlier
>> rejection; it is iterative but should always converge (?) and doesn't rely
>> on external inputs for location and scale, although it is designed for
>> normal distributions.
>
> If you sent them directly to me, this would be helpful.

Thanks,
It looks to me that the recent trend for outlier detection goes in the
direction of using robust methods:
http://www.academicjournals.org/sre/PDF/pdf2010/4Apr/Sisman.pdf
http://journals.cambridge.org/action/displayAbstract?fromPage=online&aid=6315860
http://www.springerlink.com/content/g54318112317r380/

Sorry for the delay, I got distracted before hitting send.

I need to go back to other things, but Skipper started again to
improve robust methods.

Josef

josef...@gmail.com

unread,
Oct 31, 2012, 12:33:34 PM10/31/12
to pystat...@googlegroups.com
clarification:
outlier and influence measures for OLS are based on measures that
leave the observation out.
So they are robust to one outlier.
If there are several outliers they can mask each other and many of
those outlier measures can break down for this case.

Josef
Reply all
Reply to author
Forward
0 new messages