Is there something in statsmodels that behaves exactly as MatLab's robustfit? I need the same functionalty, i.e. the same weighted functions that are used in this and produce the same results.
This comes about from my work translating some MatLab code into python, and it needs to have identical outputs since the MatLab code is widely used, and to get people to convert, it has to have identical functionality.
Thanks.
The value r in the weight functions is
r = resid/(tune*s*sqrt(1-h))
where resid is the vector of residuals from the previous iteration, h is the vector of leverage values from a least-squares fit
"
RLM doesn't divide by sqrt(1-h) and I haven't seen that one before, but could be added as option.
scale=MAD
"If there are p columns in X, the smallest p absolute deviations are excluded when computing the median."
We don't exclude any values from median calculations, and I dropped the median completely in the PR IIRC, assuming that the expected value of the residual is zero (which it is supposed to be).
Which I guess it means that you only get a few decimal agreement in regular cases, and different behavior corner or extreme cases.
What's the variance of [1, 1, 1, 1, 1, 1, 2, -1] ?
Do we have two outliers and variance=0, or should be have a strictly positive variance?
It's only really possible to guess from the context, but I'm still not sure what the default assumption should be.
Josef
Josef
Thanks.
Hi,I remember having seen a Matlab code implementing (Huber's) robust regression once. I think it was more or less corresponding to the paper [1] and I do not know about its relationship with Matlab's robustfit.
But one thing I remember well is that the scale update was done with a MAD (divided by a factor similar to sqrt(1 - h), where h reflects the diagonal entries of the covariance matrix of the design... but that part I do not remember exactly).
In all the implementations that I found and all the literature of applied statistics that I browsed, the scale update was not done according to Huber's recommendations. Only statsmodels allows to do it correctly.
Of course, I have not been exhaustive in my research.
Mark Woolrich, Robust group analysis using outlier inference, NeuroImage, Volume 41, Issue 2, June 2008, Pages 286-301, ISSN 1053-8119, http://dx.doi.org/10.1016/j.neuroimage.2008.02.042. (http://www.sciencedirect.com/science/article/pii/S1053811908001778)
At the time I did my bibliographic work, I was surprised by the fact that people tend to use a version of the algorithm that has never been shown to converge, even though I understand the theoretical version (with the right scale update) is a bit trickier.
<josef...@gmail.com> wrote:There is just one, integrate over nuisance parameters :-D
> One problem with robust estimation is that there are too many
> choices/options.
Not a problem for a Bayesian :-D
> The other problem is that we need a reference model to
> interpret outliers,
> which is by default the normal distribution.
> (and data like [1, 1, 1, 1, 1, 1, 2.5, -1.5] definitely doesn't look
> "normal", maybe someone rounded too much.)
Which is why scientistis should start to use GLM or GNMs when the
least-squares
model is inadequate.
In the 1930s the linear least-squares model were the only tractable, when
the available tools were
slipsticks and mechanical calculators (often with female operators called
"computers"). Today, we have desktop computers with (in comparison)
infinite computing power. It is really a shame that the applied statistics
we teach to most students is stuck in the 1930s, or so it seems. There is a
universe of possibilities beyond ANOVA and linear regression today. We
don't have to model the world as linear and normally distributed.
Sturla
<josef...@gmail.com> wrote:GNM and GAM can be used on many of the same problems, but GNM allows us to
> We have GLM, and I was just trying to figure out which bonus features we
> want to add to it.
> But, I've never seen an example for generalized nonlinear models, but it
> would be relatively easy to add with a bit of work.
specify the function instead of just using a lowess (or loess?) smoother.
For example we can have repeated measures over a period of time, and we
might know that the shape of the response against time curve should be
logistic. Many researchers i biology and medicine will routinely use
repeated-measures ANOVA to analyse these data sets, i.e. they will look for
a group x time interaction, but the model is usually misspecified. GNMs can
deal with these problems much more correctly.
Since they allow us to specify a probability family, they can better handle
> But they won't make RLM obsolete, and I had started to look at how to add
> (outlier) robust estimation to GLM and the discrete Models.
a long-tailed error distribution.
Sturla