[pystatsmodels] Tukey Post Hoc Test-three samples

425 views
Skip to first unread message

H Raja

unread,
May 24, 2010, 2:21:41 PM5/24/10
to pystatsmodels
Hi, I'm new to this board. I just started using python 2 months ago
for some biostatistical work. I tried using R, but R is a real pain to
learn, especially for beginners like me.
Here's a simple Tukey post hoc test for three samples after anova. I
do hope it works. The qcrittable is simply a qcrit table I got from a
website. I know this is pretty bad code, but someone smarter should be
able to improve it :)







import xlrd
import xlwt
import scipy.stats
import numpy
import math


def Tukeythreegene(first,second,third): #Performing the Tukey HSD post-
hoc test for three genes
qwb = xlrd.open_workbook('F:/Lab/bioinformatics/qcrittable.xls')
#opening the workbook containing the q crit table
qwb.sheet_names()
qcrittable = qwb.sheet_by_name(u'Sheet1')

firstmean = numpy.mean(first) #means of the three arrays
secondmean = numpy.mean(second)
thirdmean = numpy.mean(third)

firststd = numpy.std(first) #standard deviations of the three
arrays
secondstd = numpy.std(second)
thirdstd = numpy.std(third)

firsts2 = math.pow(firststd,2) #standard deviation squared of the
three arrays
seconds2 = math.pow(secondstd,2)
thirds2 = math.pow(thirdstd,2)

mserrornum = firsts2*2+seconds2*2+thirds2*2 #numerator for mean
square error
mserrorden = (len(first)+len(second)+len(third))-3 #denominator
for mean square error
mserror = mserrornum/mserrorden #mean square error

standarderror = math.sqrt(mserror/len(first)) #standard error,
which is square root of mserror and the number of samples in a group

dftotal = len(first)+len(second)+len(third)-1 #various degrees of
freedom
dfgroups = 2
dferror = dftotal-dfgroups

qcrit = qcrittable.cell(dftotal, 3).value #getting the q critical
value, for degrees of freedom total and 3 groups

qtest3to1 = (math.fabs(thirdmean-firstmean))/standarderror
#calculating q test statistic values
qtest3to2 = (math.fabs(thirdmean-secondmean))/standarderror
qtest2to1 = (math.fabs(secondmean-firstmean))/standarderror

conclusion = []

## print qcrit
## print qtest3to1
## print qtest3to2
## print qtest2to1

if(qtest3to1>qcrit): #testing all q test statistic values to q
critical values
conclusion.append('3to1null')
else:
conclusion.append('3to1alt')
if(qtest3to2>qcrit):
conclusion.append('3to2null')
else:
conclusion.append('3to2alt')
if(qtest2to1>qcrit):
conclusion.append('2to1null')
else:
conclusion.append('2to1alt')

return conclusion

Vincent Davis

unread,
May 25, 2010, 12:17:53 AM5/25/10
to pystat...@googlegroups.com
Well you didn't seem to have a question but here are a few next step improvements to you code, you'll need to finish for it to work but you should get the idea. You might also that a look at biopython. I have done some stuff with micro array data and have been meaning to duplicate some of the functions available in R/bioconductor, I don't care much for it either but the bioconductor packages are good.


def Tukeythreegene(genes): #Performing the Tukey HSD post-hoc test for three genes
   """gend is a list, ie [first, second, third]"""
   qwb = xlrd.open_workbook('F:/Lab/bioinformatics/qcrittable.xls')
#opening the workbook containing the q crit table
   qwb.sheet_names()
   qcrittable = qwb.sheet_by_name(u'Sheet1')
   
   means = []
   stds = []
   for gene in genes:
      means.append(numpy.mean(gene))
      std.append(numpy.std(gene))
      
   #firstmean = numpy.mean(first) #means of the three arrays
   #secondmean = numpy.mean(second)
   #thirdmean = numpy.mean(third)

   #firststd = numpy.std(first) #standard deviations of the three arrays
   #secondstd = numpy.std(second)
   #thirdstd = numpy.std(third)
   
   stds2 = []
   for std in stds:
      stds2.append(math.pow(std,2))
   
   
   #firsts2 = math.pow(firststd,2) #standard deviation squared of the three arrays
   #seconds2 = math.pow(secondstd,2)
   #thirds2 = math.pow(thirdstd,2)

   #mserrornum = firsts2*2+seconds2*2+thirds2*2 #numerator for mean square error
   mserrornum = sum(stds2)*2
   mserrorden = (len(genes[0])+len(genes[1])+len(genes[2]))-3 #denominator for mean square error
   mserror = mserrornum/mserrorden #mean square error

josef...@gmail.com

unread,
May 25, 2010, 12:58:25 AM5/25/10
to pystat...@googlegroups.com
I had to do some search to find out what the test is doing. A
docstring with a brief description would be very useful.

p-value correction for multiple tests is a very important topic, but
neither scipy.stats nor statsmodels has anything in this area yet.

I have no good overview over this, and only looked a bit at Bonferroni
before. (I have a better idea about similar problems in other areas of
econometrics. In a related example for anova or test for equality of
coefficients, I only report uncorrected p-values.)

in
http://www.itl.nist.gov/div898/handbook/prc/section4/prc47.htm

I found this

Tests on Means after Experimentation
Procedures for performing multiple comparisons If the decision on what
comparisons to make is withheld until after the data are examined, the
following procedures can be used:

Tukey's Method to test all possible pairwise differences of means to
determine if at least one difference is significantly different from
0.
Scheffé's Method to test all possible contrasts at the same time, to
see if at least one is significantly different from 0.
Bonferroni Method to test, or put simultaneous confidence intervals
around, a pre-selected group of contrasts

I think it would be good if either scipy.stats or statsmodels would
expand in this area.

just two comments as addition to what Vincent said

Do you know the background for the critical values? Some web pages I
found, say that the critical values are standard textbook material,
NIST dataplot is supposed to have it but I didn't find any
description.

For several samples, there are two competing ways of specifying the
data, one array per sample/group, or all in a single array with a
second array specifying from which sample or group the observation
comes from. Some functions in scipy.stats use the former, I usually
prefer the latter, because it's easier to work with an arbitrary
number of groups and because np.bincount (and some other functions I
wrote) can calculate group means and group variances easily.
Although in this case with just a few groups it wouln't make much of a
difference.

I assume in your function (first,second,third) already have to be
pre-sorted, I haven't figured out completely how the sequential
testing rule works.

Josef

H Raja

unread,
May 25, 2010, 12:38:53 PM5/25/10
to pystatsmodels
Vincent-

Actually I do have one question, which is in regards to the best way
to do ANCOVA using Python. Right now I'm using R for it, but being
able to do it with Python would enable me to integrate it into my
program much better. I'm currently doing data analysis on microarrays
as well. I will check out bioconductor, but so far I have been able to
do everything I needed with just python. It is just so much easier to
pull pieces of data from multiple files and piece them together in
Python.

I will incorporate the suggestions you made. I was racking by brains
on how I can apply this function to ANOVAs with more than three sets
of values. I'll come back if I have any problems on this.




Josef-

I should have explained this a bit more, sorry. I use the program
SigmaPlot quite a bit for statistical work. But it is simply not
programmable so easily. SigmaPlot is fairly basic as far as statistics
go, but Python-Stats doesn't have all the features that SigmaPlot has.
I would like to see Python achieve atleast that level of functionality
as found in SigmaPlot. If that happens, then Python will be a serious
contender in the statistical analysis field. So one of the gaps I saw
was that after ANOVA there are no post hoc tests.

The Tukey-HSD test is simply to compare two groups together, much like
the t-test. But it is simply more conservative. It is a simple
comparison of means so sorting isn't needed from what I see so far.
After I was done with this I will write some code for the Bonferroni,
qtest, and other post tests.

The qcritical values table was from a website and I verified it with
the table found in "Biostatistical Analysis" by Zar(this is also the
same place that I got the equations for the Tukey test). I still don't
know how I can incorporate a table within Python. I can do
dictionaries, but how do you make a table that you can access using a
column and row number?

I know there are a few improvements, but not sure how to do them. I've
only been programming Python for two months(I'm also just as new to
statistics). I will make the changes Vincent pointed out, and I will
add an explanatory portion like you mentioned to this as well so we
can have some clear cut documentation.


Regards,

Vikas
> inhttp://www.itl.nist.gov/div898/handbook/prc/section4/prc47.htm
> > vinc...@vincentdavis.net
>
> > my blog | LinkedIn- Hide quoted text -
>
> - Show quoted text -- Hide quoted text -
>
> - Show quoted text -

josef...@gmail.com

unread,
May 25, 2010, 1:55:04 PM5/25/10
to pystat...@googlegroups.com
On Tue, May 25, 2010 at 12:38 PM, H Raja <vka...@gmail.com> wrote:
> Vincent-
>
> Actually I do have one question, which is in regards to the best way
> to do ANCOVA using Python. Right now I'm using R for it, but being
> able to do it with Python would enable me to integrate it into my
> program much better. I'm currently doing data analysis on microarrays
> as well. I will check out bioconductor, but so far I have been able to
> do everything I needed with just python. It is just so much easier to
> pull pieces of data from multiple files and piece them together in
> Python.
>
> I will incorporate the suggestions you made. I was racking by brains
> on how I can apply this function to ANOVAs with more than three sets
> of values. I'll come back if I have any problems on this.
>

ANOVA type analysis is currently still a lot of bits and pieces but no
systematic coverage.

there was an interesting thread last November and December on the nipy list
http://mail.scipy.org/pipermail/nipy-devel/2009-November/002257.html

maybe this might be also interesting to you
http://bazaar.launchpad.net/~scipystats/statsmodels/trunk/annotate/head:/scikits/statsmodels/sandbox/regression/onewaygls.py
http://bazaar.launchpad.net/~scipystats/statsmodels/trunk/annotate/head:/scikits/statsmodels/sandbox/examples/ex_onewaygls.py


>
>
> Josef-
>
> I should have explained this a bit more, sorry. I use the program
> SigmaPlot quite a bit for statistical work. But it is simply not
> programmable so easily. SigmaPlot is fairly basic as far as statistics
> go, but Python-Stats doesn't have all the features that SigmaPlot has.
> I would like to see Python achieve atleast that level of functionality
> as found in SigmaPlot. If that happens, then Python will be a serious
> contender in the statistical analysis field. So one of the gaps I saw
> was that after ANOVA there are no post hoc tests.

I hope we can close several gaps this year.

>
> The Tukey-HSD test is simply to compare two groups together, much like
> the t-test. But it is simply more conservative. It is a simple
> comparison of means so sorting isn't needed from what I see so far.

My impression from the quick reading of the Wikipedia page, was that
the test starts with the largest difference and works to the smaller
ones, which I thought you were doing. But, as I said, I didn't try to
understand all the details.

> After I was done with this I will write some code for the Bonferroni,
> qtest, and other post tests.
>
> The qcritical values table was from a website and I verified it with
> the table found in "Biostatistical Analysis" by Zar(this is also the
> same place that I got the equations for the Tukey test). I still don't
> know how I can incorporate a table within Python. I can do
> dictionaries, but how do you make a table that you can access using a
> column and row number?

I would save them from excel to a csv file and then load it into numpy
directly or convert it to an array in a module file.
If they are published values, we can include them without copyright
problems. (But maybe it's possible to generate them directly following
R or dataplot.)

>
> I know there are a few improvements, but not sure how to do them. I've
> only been programming Python for two months(I'm also just as new to
> statistics). I will make the changes Vincent pointed out, and I will
> add an explanatory portion like you mentioned to this as well so we
> can have some clear cut documentation.

Welcome aboard and sorry about the delay in approving your initial message.

Josef

Vincent Davis

unread,
May 25, 2010, 2:17:46 PM5/25/10
to pystat...@googlegroups.com
On Tue, May 25, 2010 at 10:38 AM, H Raja <vka...@gmail.com> wrote:
Vincent-

Actually I do have one question, which is in regards to the best way
to do ANCOVA using Python. Right now I'm using R for it, but being
able to do it with Python would enable me to integrate it into my
program much better. I'm currently doing data analysis on microarrays
as well. I will check out bioconductor, but so far I have been able to
do everything I needed with just python. It is just so much easier to
pull pieces of data from multiple files and piece them together in
Python.
 
I'll look into this a bit but Josef or Skipper might better be able to answer this.
You looked at biopython.org? they have good tools for reading fasta and other file formats as well as interacting with blast and other sequence search tool.
Most of what I have done has been comparing individual probes which is very difficult and not so fruitful. I agree that python is a good choice as the data is difficult to deal wuth and get into the format yuo need.
I do have a quantile-normalization written in python if you have need.


I will incorporate the suggestions you made. I was racking by brains
on how I can apply this function to ANOVAs with more than three sets
of values. I'll come back if I have any problems on this.

 
The qcritical values table was from a website and I verified it with

the table found in "Biostatistical Analysis" by Zar(this is also the
same place that I got the equations for the Tukey test). I still don't
know how I can incorporate a table within Python. I can do
dictionaries, but how do you make a table that you can access using a
column and row number?

You can just use numpy arrays for this but you might be interested in using larrt, http://larry.sourceforge.net/
 

I know there are a few improvements, but not sure how to do them. I've
only been programming Python for two months(I'm also just as new to
statistics). I will make the changes Vincent pointed out, and I will
add an explanatory portion like you mentioned to this as well so we
can have some clear cut documentation.

Is this a one time project use you have for this or are you trying to build something to share?

If you are reading in large files I found pickling a fast way to save and open the data.

H Raja

unread,
May 26, 2010, 1:16:20 PM5/26/10
to pystatsmodels
Josef-

Thank you for the idea on the csv file. I'll try that out, as well as
try and understand larry.

"My impression from the quick reading of the Wikipedia page, was that
the test starts with the largest difference and works to the smaller
ones, which I thought you were doing. But, as I said, I didn't try to
understand all the details. "

I will double check the wiki page, however I will use standard
equations that I find in "Biometry" by Sokal, and "Biostatistical
Analysis" by Zar. They are fairly standard biostat publications.


"Welcome aboard and sorry about the delay in approving your initial
message. "

Thank you much. I do hope to contribute as much as I can with my tight
schedule.


Vincent -


"Most of what I have done has been comparing individual probes which
is very
difficult and not so fruitful. I agree that python is a good choice as
the
data is difficult to deal wuth and get into the format yuo need.
I do have a quantile-normalization written in python if you have need.
"

That sounds great, I'm sure I will need that in near future. I have
been doing comparative genomics using microarray data. Basically data
dredging :). The only significant algorithm I wrote is the Tukey test.
The rest are just simple loops. However if you or anyone needs help
with how I piece files from different sets of data, I'd be glad to
help.

"Is this a one time project use you have for this or are you trying to
build
something to share?

If you are reading in large files I found pickling a fast way to save
and
open the data. "

A brief bit about myself. I just recently go my bachelors in genomics,
and am now going to start my quantitative biology. Most of the
research work I was doing was wet lab, but I got pulled off to do some
informatics projects. I am not sure myself how long I will be doing
this, but I will contribute to the best of my ability.

I should be having the Tukey test ready this weekend. I'll try and fix
it up so that it can take multiple samples, with different sample
lengths as well.

Regards,

Vikas

> larrt,http://larry.sourceforge.net/

> ...
>
> read more »- Hide quoted text -

josef...@gmail.com

unread,
Jul 4, 2010, 8:47:30 AM7/4/10
to pystat...@googlegroups.com

An additional application and reference from the irc chanel:

[17:08] <bdesk> Hi, I want to do a kruskal wallis test in python so
that i can compare the ranks of groups.
[17:08] <bdesk> I found this
http://www.scipy.org/doc/api_docs/SciPy.stats.stats.html#kruskal
[17:08] <bdesk> but it looks like it just returns a statistic and
pvalue that says whether all the groups are the same as each other.
[17:09] == sophacles
[~soph...@99-20-142-25.lightspeed.dctril.sbcglobal.net] has quit
[Remote host closed the connection]
[17:09] <bdesk> whereas I want more detail, as shown at the bottom of
the output here http://www.texasoft.com/winkkrus.html
[17:10] <bdesk> Can I use scipy to do this, or do I have to use something else?
[17:14] <josefpktd> bdesk: in your reference do you mean the different
ranksums for each group or the "Tukey Multiple Comp. Difference" ?
[17:15] <bdesk> I mean whichever part I would need to infer the little
ascii bars at the bottom
[17:18] <bdesk> josefpktd: so yes, i want the whole table that has the
"Tukey Multiple Comp. Difference" column as well as the Q and critical
q (.05).
[17:19] <josefpktd> bdesk: "graphical representation of the Tukey
multiple comparisons test", you could do the pairwise tests in a loop
of all pairs, but I think Tukey does a correction of the p-values for
multiple comparison


Josef

Reply all
Reply to author
Forward
0 new messages