I added a keyword, a, to allow an offset for the plotting points. This
allows us to replicate both R and Stata now (they use .5 and 0 for
qqnorm according to the docs and this looks correct).
I also updated the behavior for line=True. We talked a bit about this
at SciPy, but I still think the idea of standardizing the input data
is orthogonal to whether or not to include a reference line. I'm fine
with making them both optional, but I don't think one affects the
other. Before, you needed to standardize for the 45-degree reference
line to make sense. Now, we just scale the reference line. So the
default reference line now is
m,b = sample_quantile.std(), sample_quantile.mean()
ref_line = theoretical_quantile*m + b
plt.plot(theoretical_quantile, ref_line)
If the data is standardized m=1 and b=0, you get the 45-degree line.
If not, you get a reference line scaled to the input data.
If you don't want either of these as a reference line, then use the
default line=False, and make your own after the call to qqplot before
saving or showing the plot.
This is still in the qqplot branch in the main repository. If these
changes make sense to all, then I'll merge them into master.
https://github.com/statsmodels/statsmodels/commits/qqplot/scikits/statsmodels/graphics/qqplot.py
Skipper
Two things that I'm not sure about:
Pierre used a two parameter plotting position correction, for example
http://rss.acs.unt.edu/Rdoc/library/POT/html/qq.uvpot.html
uses p_{j:n} = (j - 0.35) / n
which doesn't fit in the current formula. It might be more common in
the extreme value literature. Although I find the two value
parameterization of Pierre a bit difficult to read.
I'm not quite sure what the new generalized line looks like in
different cases, but I might want to use a fixed 45 degree line,
instead of one that does additional adjustments.
For example for standard t- distribution, scale is different from
standard deviation, but I don't know whether that still applies for
ranks.
The other case is when I want the plot against a specific parametric
distribution (with given fixed parameters, e.g. I want to check mean
or variance shifts) and not against a distribution family, then I
don't want any adjustment to my 45 degree line.
But without playing with it (soon), I don't know what happens in these
cases with the line adjustments.
Josef
(OT: I finally got my cable internet line fixed and don't loose it
anymore every 3 minutes)
>
> Skipper
>
I tried it out
https://picasaweb.google.com/106983885143680349926/Joepy#5644990474809138898
and next 3 plots
example: true t with df=3, against t with fit
x = 0.0 + 1. * stats.norm.rvs(size=100)
x = 0.0 + 1. * stats.t.rvs(3, size=100)
#x = stats.zscore(x)
#fig = sm.qqplot(x, stats.norm, fit=True, a=0.5, line=True)#, distargs=(4,))
fig = sm.qqplot(x, stats.t, fit=True, a=0., line=True)
I had also tried against t with df=4 without fit
and in qqplot if line I added 45 degree line in green
plt.plot(theoretical_quantiles, ref_line, 'r-')
plt.plot(theoretical_quantiles, theoretical_quantiles, 'g-')
-----
so your line looks equivalent to zscoring the data, which is a fit in
the normal case, but it's the wrong line for the t-distribution.
I'm still in favor of just the 45 degree line.
I know the plotting positions mainly from David Huard's and Pierre's
code, In what I'm reading I see almost only i/(n+1) to avoid hitting 0
and 1.
Josef
>
>
>>
>> Skipper
>>
>
The 45 degree line doesn't mean anything when we don't scale the data
unless we also scale the theoretical quantiles. Try the first example
in the docstring with line=True only.
The other option is to make a higher level function that takes
anything as keywords and then make a qt, qnorm, qchi, or whatever, but
I don't see how a 45 degree-line can be general, and the user has the
option to add their own.
> I know the plotting positions mainly from David Huard's and Pierre's
> code, In what I'm reading I see almost only i/(n+1) to avoid hitting 0
> and 1.
I only looked at the (base) R and Stata docs and their references for this.
Skipper
the 45 degree lines means: Is it from the specified distribution?
if fit=false, then the Null Hypothesis is whether it comes from the
given distribution
sm.qqplot(res, line=True, fit=False) means: Is res standard normal,
i.e. N(0,1) distributed.
the null hypothesis is not: Is res normally distributed (for any mu, sigma)
if fit=true or the user has already estimated the parameters, then
this is a Null Hypothesis against the class, e.g. is it normally
distributed
data = sm.datasets.longley.load()
data.exog = sm.add_constant(data.exog)
mod_fit = sm.OLS(data.endog, data.exog).fit()
res = mod_fit.resid
plt.figure()
fig = sm.qqplot(res, line=True, fit=True)
plt.figure()
fig = sm.qqplot(stats.zscore(res), line=True)
Using Longley for example of use case for specific (fully specified)
null distribution:
Are the first half and second half of the sample identically
distributed according to joint regression? or are the mean and
variance shifts
#hypothesis: identical mean and identical variance (assuming identical beta)
plt.figure()
sm.qqplot(res[:8]/res.std(), line=True)
ax = plt.gca()
ax.set_title('subsample 1 against joint, no mean correction')
plt.figure()
sm.qqplot(res[8:]/res.std(), line=True)
ax = plt.gca()
ax.set_title('subsample 2 against joint, no mean correction')
#hypothesis shift in mean but identical error variance (assuming identical beta)
plt.figure()
sm.qqplot((res[:8] - res[:8].mean())/res.std(), line=True)
ax = plt.gca()
ax.set_title('subsample 1 against joint, different mean')
plt.figure()
sm.qqplot((res[8:] - res[8:].mean())/res.std(), line=True)
ax = plt.gca()
ax.set_title('subsample 2 against joint, different mean')
How stable is the Longley regression overall? subset regression
#example: Longley with splitting sample in the middle
res1 = sm.OLS(data.endog[:8], data.exog[:8,:]).fit().resid
res2 = sm.OLS(data.endog[8:], data.exog[8:,:]).fit().resid
res1scaled = stats.zmap(res1, res)
res2scaled = stats.zmap(res2, res)
plt.figure()
sm.qqplot(res1scaled, line=True)
ax = plt.gca()
ax.set_title('regression subsample 1')
sm.qqplot(res2scaled, line=True)
ax = plt.gca()
ax.set_title('regression subsample 2')
'''result not very good because longley is a messy dataset
>>> res1.std()
38.479230018144023
>>> res2.std()
113.26130589097247
>>> res.std()
228.64055517147312
>>> res[:8].std()
230.21852406900729
>>> res[8:].std()
220.45605765802966
'''
The same difference in null hypothesis applies to all GOF tests: for example
kstest: specific distribution (all parameters fixed). The user can
estimate the parameters first but then the p-values are wrong.
chisquare test: specific distribution (all parameters fixed). If the
user estimates the parameters first, then the asymptotic distribution
with ddof is correct for some estimation procedures.
anderson: distribution family (parameters are estimated). Is only
available for some distributions. Parameter estimation cannot be
turned off in scipy.stats.anderson
> The other option is to make a higher level function that takes
> anything as keywords and then make a qt, qnorm, qchi, or whatever, but
> I don't see how a 45 degree-line can be general, and the user has the
> option to add their own.
Because residual check for normal distribution will be for us the most
common case, I thought the defaults should be: line=true, fit=true,
which corresponds to your original version of qqplot.
So we could make a special case just with
def qqnormal(data):
return sm.qqplot(stats.zscore(data), line=True)
(as an aside: I didn't find a quick way to change the color of the
scatter, so I can display 2, or k, samples on the same plot.)
Josef
a bit nicer: subset regression example
#example: Longley with splitting sample in the middle
res1 = sm.OLS(data.endog[:8], data.exog[:8,:]).fit().resid
res2 = sm.OLS(data.endog[8:], data.exog[8:,:]).fit().resid
stdavg = 0.5 * (res1.std() + res2.std())
res1scaled = res1 / stdavg
res2scaled = res2 / stdavg
plt.figure()
sm.qqplot(res1scaled, line=True)
sm.qqplot(res2scaled, line=True)
ax = plt.gca()
ax.lines[0].set_c('r')
ax.lines[4].set_c('b')
ax.set_title('regression subsamples')
(as an aside: I didn't find a quick way to change the color of the
scatter, so I can display 2, or k, samples on the same plot.)
I realize that. My point is that we know something on the scale of
-400 to 400 didn't come from the standard normal. This is why I was
standardizing all the data when it came in in my version. Chris said
he didn't want to do that, which is fine. But if we don't standardize
the data and we ask for a reference line, I assume the user doesn't
want a big flat line in the middle of his graph, which was the
default, 45 degree line. See attached. That's what I mean when I say
it doesn't mean anything.
>> The other option is to make a higher level function that takes
>> anything as keywords and then make a qt, qnorm, qchi, or whatever, but
>> I don't see how a 45 degree-line can be general, and the user has the
>> option to add their own.
>
> Because residual check for normal distribution will be for us the most
> common case, I thought the defaults should be: line=true, fit=true,
> which corresponds to your original version of qqplot.
>
Chris argued that he didn't want to standardize, and I can see this
now, but the default line then should not be something that's
obviously not what the user wants. There should be a way (as in Stata
and in R) to not standardize the data and get a reference line.
Skipper
It means, the sample didn't come from N(0,1) that's what you asked for.
>
>
>>> The other option is to make a higher level function that takes
>>> anything as keywords and then make a qt, qnorm, qchi, or whatever, but
>>> I don't see how a 45 degree-line can be general, and the user has the
>>> option to add their own.
>>
>> Because residual check for normal distribution will be for us the most
>> common case, I thought the defaults should be: line=true, fit=true,
>> which corresponds to your original version of qqplot.
>>
>
> Chris argued that he didn't want to standardize, and I can see this
> now, but the default line then should not be something that's
> obviously not what the user wants.
If fit=False, then this *IS* what the user *wants* and asks for!
(with: I in users)
If the user wants rescaling, then the correct option is fit=True.
Josef
No other software behaves like this for good reason. The other thing I
was thinking yesterday is that standardizing and fit really shouldn't
be the same option. You can still transform the theoretical quantiles
by passing in loc and scale without having to change the data.
Skipper
A rough count:
> ??qqplot in R shows about 14 different versions of qqplot in various packages (or 14 is the count of packages that each have their own qqplots.)
qqnorm seems specialized to normal case from the help
I think now, the best will be to load up qqplot with options that
cover all cases that we can think of. I don't care specifically about
the defaults.
And then we write some additional functions with special parameter
settings, with automatic rescaling in qqnorm or qqnormal as Skipper
wants.
>
>>
>> (as an aside: I didn't find a quick way to change the color of the
>> scatter, so I can display 2, or k, samples on the same plot.)
>>
> So that should be added as a parameter?
I figured out how to access the underlying line color, so maybe we can
just make a separate qqplot for k-sample comparisons.
I don't know if we need to add a plotkwds={} or plotkwds=None for the
`...` in R's qqnorm
Josef
You have the behavior that you want with qqnorm.
According to this http://www.stata.com/help.cgi?qqplot STATA doesn't
have a generic qqplot, only two sample or against normal or chi2
Maybe I read to much GOF literature, but even for graphical helper
plots, I find it "dirty" if the graph doesn't have a clear null
hypothesis.
I wanted to check what evir:qplot does:
http://hosho.ees.hokudai.ac.jp/~kubo/Rdoc/library/evir/html/qplot.html
(I have version 1.7-1 on my computer but the help is the same.)
It doesn't specify at all what the line is. A brief look at the source
shows that it is a regression line.
So what's the null hypothesis of the line if my data is a squared
normal and the reference distribution is exponential (default)?
>The other thing I
> was thinking yesterday is that standardizing and fit really shouldn't
> be the same option. You can still transform the theoretical quantiles
> by passing in loc and scale without having to change the data.
Standardizing and fit and passing in loc=data.mean(), scale=data.std()
are the same for the normal distribution, except for the labeling of
the axes.
For most other distributions standardizing doesn't make any sense.
We could also make
line : string or None
specifies which type of line to plot, options: '45',
'standardized', 'regress', 'quartile_diff'
which can also be shortened to '45', 's', 'r', 'q'. If fit is
true, then the default is '45', otherwise it is None, no line.
Josef
>
> Skipper
>
while doing some semi-random readings
plotting a linear regression line seems to be the standard in extreme
value theory
(but it's for non-linear increasing transformation of the random
variable that makes it linear with respect to some specific reference
distribution, which does an implicit partial fit, I think)
QRMlib/html/QQPlot.html
contains: "See Beirlant et al, "Statistics of Extremes", Chapter 1.2.1
for descriptions of various QQ plots."
media.wiley.com/product_data/excerpt/74/04719764/0471976474.pdf
Josef
>
> Josef
>
>>
>> Skipper
>>
>
I don't follow. Why does centering the data and dividing by the
standard deviation not make sense for other distributions? Honest
question. I was under the impression it's just to make things have a
comparable scale.
> We could also make
> line : string or None
> specifies which type of line to plot, options: '45',
> 'standardized', 'regress', 'quartile_diff'
> which can also be shortened to '45', 's', 'r', 'q'. If fit is
> true, then the default is '45', otherwise it is None, no line.
>
Sounds reasonable. I'll push this change then merge to master. Others
can have at it, if they're not happy.
> Josef
>
>>
>> Skipper
>>
>
For most distributions dividing by the standard deviation does not
produce a "standard" distribution.
example t-distribution
the standard deviation depends both on the degrees of freedom and on
the "scale" (in the rescale the variable, linear transformation sense
as in scipy.stats)
If we divide by the "scale", then we get a *standard* t-distributions,
which does not have std=1. If we devide by std, then the scale
compared to the *standard* t-distributions is not one.
Dividing by std to standardize a distribution only works if the
variance of the *standard* distribution is 1.
as an aside
*Standard* t-distribution was developed for tests (as far as I know).
For econometrics/statistics with t-distributed residuals this
standard, variance not equal 1, is not so useful. Many (or most)
econometricians work with "standardized" t-distribution. It's a
reparameterization, so that the standard deviation of the
"standardized" t-distribution is 1. (available somewhere in the
sandbox, as comparison but not used)
In that case rescaling by std standardizes to the standardized t-distribution.
Josef
https://github.com/statsmodels/statsmodels/commit/c305d48882da97de3271bcdc2f4cdeb87780c8ec
Skipper