qq plot added and qq plot changes

142 views
Skip to first unread message

Skipper Seabold

unread,
Aug 25, 2011, 3:45:07 PM8/25/11
to pystat...@googlegroups.com
I merged the qq plot pull request from Chris, and made some changes.

I added a keyword, a, to allow an offset for the plotting points. This
allows us to replicate both R and Stata now (they use .5 and 0 for
qqnorm according to the docs and this looks correct).

I also updated the behavior for line=True. We talked a bit about this
at SciPy, but I still think the idea of standardizing the input data
is orthogonal to whether or not to include a reference line. I'm fine
with making them both optional, but I don't think one affects the
other. Before, you needed to standardize for the 45-degree reference
line to make sense. Now, we just scale the reference line. So the
default reference line now is

m,b = sample_quantile.std(), sample_quantile.mean()
ref_line = theoretical_quantile*m + b
plt.plot(theoretical_quantile, ref_line)

If the data is standardized m=1 and b=0, you get the 45-degree line.
If not, you get a reference line scaled to the input data.

If you don't want either of these as a reference line, then use the
default line=False, and make your own after the call to qqplot before
saving or showing the plot.

This is still in the qqplot branch in the main repository. If these
changes make sense to all, then I'll merge them into master.

https://github.com/statsmodels/statsmodels/commits/qqplot/scikits/statsmodels/graphics/qqplot.py

Skipper

josef...@gmail.com

unread,
Aug 25, 2011, 4:33:27 PM8/25/11
to pystat...@googlegroups.com

Two things that I'm not sure about:

Pierre used a two parameter plotting position correction, for example
http://rss.acs.unt.edu/Rdoc/library/POT/html/qq.uvpot.html
uses p_{j:n} = (j - 0.35) / n
which doesn't fit in the current formula. It might be more common in
the extreme value literature. Although I find the two value
parameterization of Pierre a bit difficult to read.


I'm not quite sure what the new generalized line looks like in
different cases, but I might want to use a fixed 45 degree line,
instead of one that does additional adjustments.

For example for standard t- distribution, scale is different from
standard deviation, but I don't know whether that still applies for
ranks.

The other case is when I want the plot against a specific parametric
distribution (with given fixed parameters, e.g. I want to check mean
or variance shifts) and not against a distribution family, then I
don't want any adjustment to my 45 degree line.

But without playing with it (soon), I don't know what happens in these
cases with the line adjustments.

Josef
(OT: I finally got my cable internet line fixed and don't loose it
anymore every 3 minutes)


>
> Skipper
>

josef...@gmail.com

unread,
Aug 25, 2011, 11:18:14 PM8/25/11
to pystat...@googlegroups.com

I tried it out

https://picasaweb.google.com/106983885143680349926/Joepy#5644990474809138898
and next 3 plots

example: true t with df=3, against t with fit
x = 0.0 + 1. * stats.norm.rvs(size=100)
x = 0.0 + 1. * stats.t.rvs(3, size=100)
#x = stats.zscore(x)

#fig = sm.qqplot(x, stats.norm, fit=True, a=0.5, line=True)#, distargs=(4,))
fig = sm.qqplot(x, stats.t, fit=True, a=0., line=True)

I had also tried against t with df=4 without fit

and in qqplot if line I added 45 degree line in green

plt.plot(theoretical_quantiles, ref_line, 'r-')
plt.plot(theoretical_quantiles, theoretical_quantiles, 'g-')

-----
so your line looks equivalent to zscoring the data, which is a fit in
the normal case, but it's the wrong line for the t-distribution.

I'm still in favor of just the 45 degree line.

I know the plotting positions mainly from David Huard's and Pierre's
code, In what I'm reading I see almost only i/(n+1) to avoid hitting 0
and 1.

Josef


>
>
>>
>> Skipper
>>
>

Skipper Seabold

unread,
Aug 25, 2011, 11:41:06 PM8/25/11
to pystat...@googlegroups.com

The 45 degree line doesn't mean anything when we don't scale the data
unless we also scale the theoretical quantiles. Try the first example
in the docstring with line=True only.

The other option is to make a higher level function that takes
anything as keywords and then make a qt, qnorm, qchi, or whatever, but
I don't see how a 45 degree-line can be general, and the user has the
option to add their own.

> I know the plotting positions mainly from David Huard's and Pierre's
> code, In what I'm reading I see almost only i/(n+1) to avoid hitting 0
> and 1.

I only looked at the (base) R and Stata docs and their references for this.

Skipper

josef...@gmail.com

unread,
Aug 26, 2011, 10:01:36 AM8/26/11
to pystat...@googlegroups.com

the 45 degree lines means: Is it from the specified distribution?

if fit=false, then the Null Hypothesis is whether it comes from the
given distribution
sm.qqplot(res, line=True, fit=False) means: Is res standard normal,
i.e. N(0,1) distributed.
the null hypothesis is not: Is res normally distributed (for any mu, sigma)

if fit=true or the user has already estimated the parameters, then
this is a Null Hypothesis against the class, e.g. is it normally
distributed

data = sm.datasets.longley.load()
data.exog = sm.add_constant(data.exog)
mod_fit = sm.OLS(data.endog, data.exog).fit()
res = mod_fit.resid

plt.figure()
fig = sm.qqplot(res, line=True, fit=True)

plt.figure()
fig = sm.qqplot(stats.zscore(res), line=True)


Using Longley for example of use case for specific (fully specified)
null distribution:
Are the first half and second half of the sample identically
distributed according to joint regression? or are the mean and
variance shifts

#hypothesis: identical mean and identical variance (assuming identical beta)

plt.figure()
sm.qqplot(res[:8]/res.std(), line=True)
ax = plt.gca()
ax.set_title('subsample 1 against joint, no mean correction')
plt.figure()
sm.qqplot(res[8:]/res.std(), line=True)
ax = plt.gca()
ax.set_title('subsample 2 against joint, no mean correction')


#hypothesis shift in mean but identical error variance (assuming identical beta)

plt.figure()
sm.qqplot((res[:8] - res[:8].mean())/res.std(), line=True)
ax = plt.gca()
ax.set_title('subsample 1 against joint, different mean')
plt.figure()
sm.qqplot((res[8:] - res[8:].mean())/res.std(), line=True)
ax = plt.gca()
ax.set_title('subsample 2 against joint, different mean')


How stable is the Longley regression overall? subset regression


#example: Longley with splitting sample in the middle

res1 = sm.OLS(data.endog[:8], data.exog[:8,:]).fit().resid
res2 = sm.OLS(data.endog[8:], data.exog[8:,:]).fit().resid
res1scaled = stats.zmap(res1, res)
res2scaled = stats.zmap(res2, res)
plt.figure()
sm.qqplot(res1scaled, line=True)
ax = plt.gca()
ax.set_title('regression subsample 1')
sm.qqplot(res2scaled, line=True)
ax = plt.gca()
ax.set_title('regression subsample 2')

'''result not very good because longley is a messy dataset
>>> res1.std()
38.479230018144023
>>> res2.std()
113.26130589097247
>>> res.std()
228.64055517147312
>>> res[:8].std()
230.21852406900729
>>> res[8:].std()
220.45605765802966
'''

The same difference in null hypothesis applies to all GOF tests: for example

kstest: specific distribution (all parameters fixed). The user can
estimate the parameters first but then the p-values are wrong.

chisquare test: specific distribution (all parameters fixed). If the
user estimates the parameters first, then the asymptotic distribution
with ddof is correct for some estimation procedures.

anderson: distribution family (parameters are estimated). Is only
available for some distributions. Parameter estimation cannot be
turned off in scipy.stats.anderson


> The other option is to make a higher level function that takes
> anything as keywords and then make a qt, qnorm, qchi, or whatever, but
> I don't see how a 45 degree-line can be general, and the user has the
> option to add their own.

Because residual check for normal distribution will be for us the most
common case, I thought the defaults should be: line=true, fit=true,
which corresponds to your original version of qqplot.

So we could make a special case just with
def qqnormal(data):
return sm.qqplot(stats.zscore(data), line=True)

(as an aside: I didn't find a quick way to change the color of the
scatter, so I can display 2, or k, samples on the same plot.)

Josef

josef...@gmail.com

unread,
Aug 26, 2011, 10:28:15 AM8/26/11
to pystat...@googlegroups.com

a bit nicer: subset regression example

#example: Longley with splitting sample in the middle
res1 = sm.OLS(data.endog[:8], data.exog[:8,:]).fit().resid
res2 = sm.OLS(data.endog[8:], data.exog[8:,:]).fit().resid

stdavg = 0.5 * (res1.std() + res2.std())
res1scaled = res1 / stdavg
res2scaled = res2 / stdavg

plt.figure()
sm.qqplot(res1scaled, line=True)


sm.qqplot(res2scaled, line=True)
ax = plt.gca()

ax.lines[0].set_c('r')
ax.lines[4].set_c('b')
ax.set_title('regression subsamples')

Christopher Jordan-Squire

unread,
Aug 26, 2011, 10:45:42 AM8/26/11
to pystat...@googlegroups.com

I prefer the defaults line=False, fit=False. That's the default for R, and that's what more R users will be comfortable with. It's also easier for interactive use. And qqplot is very much a tool for interactive data analysis. (Moreso than programmatic use, I would think). The user can, of course, define their own wrapper which resets the defaults. So perhaps leaving the defaults so they are simplest for interactive is best.

(The equivalent in R I'm thinking of is qqnorm: http://stat.ethz.ch/R-manual/R-patched/library/stats/html/qqnorm.html
In R there's a helper function, qqline, that adds a line through the first and third quantiles in lieu of a fitted line. This can make it to observe/think about tail behavior. Having something similar would be nice, but such a line would be different from a simple 45 degree line.)
 
(as an aside: I didn't find a quick way to change the color of the
scatter, so I can display 2, or k, samples on the same plot.)

So that should be added as a parameter?

-Chris JS

 

Skipper Seabold

unread,
Aug 26, 2011, 10:49:56 AM8/26/11
to pystat...@googlegroups.com

I realize that. My point is that we know something on the scale of
-400 to 400 didn't come from the standard normal. This is why I was
standardizing all the data when it came in in my version. Chris said
he didn't want to do that, which is fine. But if we don't standardize
the data and we ask for a reference line, I assume the user doesn't
want a big flat line in the middle of his graph, which was the
default, 45 degree line. See attached. That's what I mean when I say
it doesn't mean anything.


>> The other option is to make a higher level function that takes
>> anything as keywords and then make a qt, qnorm, qchi, or whatever, but
>> I don't see how a 45 degree-line can be general, and the user has the
>> option to add their own.
>
> Because residual check for normal distribution will be for us the most
> common case, I thought the defaults should be: line=true, fit=true,
> which corresponds to your original version of qqplot.
>

Chris argued that he didn't want to standardize, and I can see this
now, but the default line then should not be something that's
obviously not what the user wants. There should be a way (as in Stata
and in R) to not standardize the data and get a reference line.

Skipper

45degreeline.png

josef...@gmail.com

unread,
Aug 26, 2011, 11:00:28 AM8/26/11
to pystat...@googlegroups.com

It means, the sample didn't come from N(0,1) that's what you asked for.

>
>
>>> The other option is to make a higher level function that takes
>>> anything as keywords and then make a qt, qnorm, qchi, or whatever, but
>>> I don't see how a 45 degree-line can be general, and the user has the
>>> option to add their own.
>>
>> Because residual check for normal distribution will be for us the most
>> common case, I thought the defaults should be: line=true, fit=true,
>> which corresponds to your original version of qqplot.
>>
>
> Chris argued that he didn't want to standardize, and I can see this
> now, but the default line then should not be something that's
> obviously not what the user wants.

If fit=False, then this *IS* what the user *wants* and asks for!
(with: I in users)
If the user wants rescaling, then the correct option is fit=True.

Josef

Skipper Seabold

unread,
Aug 26, 2011, 11:03:54 AM8/26/11
to pystat...@googlegroups.com

No other software behaves like this for good reason. The other thing I
was thinking yesterday is that standardizing and fit really shouldn't
be the same option. You can still transform the theoretical quantiles
by passing in loc and scale without having to change the data.

Skipper

josef...@gmail.com

unread,
Aug 26, 2011, 11:13:00 AM8/26/11
to pystat...@googlegroups.com

A rough count:
> ??qqplot in R shows about 14 different versions of qqplot in various packages (or 14 is the count of packages that each have their own qqplots.)

qqnorm seems specialized to normal case from the help


I think now, the best will be to load up qqplot with options that
cover all cases that we can think of. I don't care specifically about
the defaults.

And then we write some additional functions with special parameter
settings, with automatic rescaling in qqnorm or qqnormal as Skipper
wants.

>
>>
>> (as an aside: I didn't find a quick way to change the color of the
>> scatter, so I can display 2, or k, samples on the same plot.)
>>
> So that should be added as a parameter?


I figured out how to access the underlying line color, so maybe we can
just make a separate qqplot for k-sample comparisons.

I don't know if we need to add a plotkwds={} or plotkwds=None for the
`...` in R's qqnorm

Josef

josef...@gmail.com

unread,
Aug 26, 2011, 12:03:41 PM8/26/11
to pystat...@googlegroups.com

You have the behavior that you want with qqnorm.

According to this http://www.stata.com/help.cgi?qqplot STATA doesn't
have a generic qqplot, only two sample or against normal or chi2

Maybe I read to much GOF literature, but even for graphical helper
plots, I find it "dirty" if the graph doesn't have a clear null
hypothesis.

I wanted to check what evir:qplot does:
http://hosho.ees.hokudai.ac.jp/~kubo/Rdoc/library/evir/html/qplot.html
(I have version 1.7-1 on my computer but the help is the same.)

It doesn't specify at all what the line is. A brief look at the source
shows that it is a regression line.
So what's the null hypothesis of the line if my data is a squared
normal and the reference distribution is exponential (default)?

>The other thing I
> was thinking yesterday is that standardizing and fit really shouldn't
> be the same option. You can still transform the theoretical quantiles
> by passing in loc and scale without having to change the data.

Standardizing and fit and passing in loc=data.mean(), scale=data.std()
are the same for the normal distribution, except for the labeling of
the axes.
For most other distributions standardizing doesn't make any sense.

We could also make
line : string or None
specifies which type of line to plot, options: '45',
'standardized', 'regress', 'quartile_diff'
which can also be shortened to '45', 's', 'r', 'q'. If fit is
true, then the default is '45', otherwise it is None, no line.

Josef

>
> Skipper
>

josef...@gmail.com

unread,
Aug 26, 2011, 10:35:02 PM8/26/11
to pystat...@googlegroups.com

while doing some semi-random readings

plotting a linear regression line seems to be the standard in extreme
value theory
(but it's for non-linear increasing transformation of the random
variable that makes it linear with respect to some specific reference
distribution, which does an implicit partial fit, I think)

QRMlib/html/QQPlot.html

contains: "See Beirlant et al, "Statistics of Extremes", Chapter 1.2.1
for descriptions of various QQ plots."

media.wiley.com/product_data/excerpt/74/04719764/0471976474.pdf

Josef
>
> Josef
>
>>
>> Skipper
>>
>

Skipper Seabold

unread,
Aug 29, 2011, 11:40:34 AM8/29/11
to pystat...@googlegroups.com

I don't follow. Why does centering the data and dividing by the
standard deviation not make sense for other distributions? Honest
question. I was under the impression it's just to make things have a
comparable scale.

> We could also make
> line : string or None
>    specifies which type of line to plot, options: '45',
> 'standardized', 'regress', 'quartile_diff'
>    which can also be shortened to '45', 's', 'r', 'q'. If fit is
> true, then the default is '45', otherwise  it is None, no line.
>

Sounds reasonable. I'll push this change then merge to master. Others
can have at it, if they're not happy.

> Josef
>
>>
>> Skipper
>>
>

josef...@gmail.com

unread,
Aug 29, 2011, 12:05:31 PM8/29/11
to pystat...@googlegroups.com

For most distributions dividing by the standard deviation does not
produce a "standard" distribution.

example t-distribution
the standard deviation depends both on the degrees of freedom and on
the "scale" (in the rescale the variable, linear transformation sense
as in scipy.stats)

If we divide by the "scale", then we get a *standard* t-distributions,
which does not have std=1. If we devide by std, then the scale
compared to the *standard* t-distributions is not one.

Dividing by std to standardize a distribution only works if the
variance of the *standard* distribution is 1.

as an aside
*Standard* t-distribution was developed for tests (as far as I know).
For econometrics/statistics with t-distributed residuals this
standard, variance not equal 1, is not so useful. Many (or most)
econometricians work with "standardized" t-distribution. It's a
reparameterization, so that the standard deviation of the
"standardized" t-distribution is 1. (available somewhere in the
sandbox, as comparison but not used)
In that case rescaling by std standardizes to the standardized t-distribution.

Josef

Skipper Seabold

unread,
Aug 29, 2011, 2:09:02 PM8/29/11
to pystat...@googlegroups.com
If y'all want to sanity check me and make sure it's general enough. I
don't think that the linear transformation from the extreme values
paper can be accommodated. We might need a plotting_pos (or some
other) argument that can be user or helper function defined, but I'll
leave that to you. I will leave this in a branch for now to give y'all
a chance to play with it. I don't plan on making anymore changes on
this.

https://github.com/statsmodels/statsmodels/commit/c305d48882da97de3271bcdc2f4cdeb87780c8ec

Skipper

Reply all
Reply to author
Forward
0 new messages